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Foreword 


I feel really honoured and pleased to have received invitation from Professor 
Koutsoyiannis to write a foreword to his fascinating scientific book entitled Stochastics of 
Hydroclimatic Extremes: A Cool Look at Risk. As a matter of fact, Professor Koutsoyiannis 
and myself do not always agree on different aspects of the science of climate change and 
its impact as well as on the justification behind mitigation and adaptation. However, we 
are both doing our best to heed the story that the raw observation data are telling us, 
without forcing them to say what they are expected to say. The book is in this spirit, taking 
a reader for a guided magical mystery tour to objective, and rational, methods and being 
free of ideology or pre-conceptions. 

The book opens the door to the thoughts of Professor Koutsoyiannis that deserve to be 
broadly known. He is an established scientist with respectful publication track. He 
authored or co-authored many journal papers that have attracted considerable attention 
and multiple citations. Now, when the book is available, the scientific community can 
conveniently access the findings reported in his seminal works in one place, instead of 
having to refer to many journal papers. There are no restrictions in the book that are 
usually imposed on journal articles, such as the word count or the need to bow to 
recommendations by reviewers and editors. The author of a book is free to shape the 
contents as he wants. Essential is that, on the one hand, the book must be scientifically 
sound and rigorous, but on the other hand, it must be interesting, so that the reader does 
not give up and walk away. In my opinion, these conditions are convincingly met by this 
volume. In his works before writing the book, Demetris has contributed to each and every 
subarea covered by the book. He reports on his own experience. 

Professor Koutsoyiannis is a prolific writer, but some of his excellent papers, 
challenging conventional wisdoms, had been rejected in established journals, so that they 
are available in author’s portal on internet. Possibly, some of them conveyed inconvenient 
truth. At times, rejection decisions were based on superficial, or simply wrong, reviews. I 
witnessed one disappointing publication attempt, first hand, sharing with Demetris the 
misfortune of having a joint paper rejected, based on two unfair, arrogant, reviews. 
Demetris promotes eponymous reviewing, when the authors’ identity is disclosed to the 
reviewers and the reviewers’ identity is disclosed to the authors. The symmetry of such 
an arrangement improves responsibility. Yet, 1am proud to state that I have co-authored 
a few journal papers with Koutsoyiannis. 

Professor Koutsoyiannis is fascinated by the overwhelming wealth of data available 
nowadays in public domain, in our brave new world. Large sets of real observation data 
can be freely accessed. So, the reader is welcome to help oneself to the data and to try 
exploratory data analysis on one’s own, to search for a pattern. 

I can imagine that many readers will go through the whole book, possibly skipping the 
masses of equations present in some chapters. For instance, there are 248 numbered 
equations in Chapter 6 and 125 in Chapter 2. However, these equations are needed for 
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those readers who wish to undertake a detailed study of selected parts of the book that 
deal with the material of relevance to the particular problem at hand. 

Professor Koutsoyiannis is a genuine ambassador of Hellenism. There are great 
recourses to ancient Greek thoughts, philosophy and poetry in the book. It is fascinating 
to observe how he explains Greek roots of words that everybody knows but not everyone 
is truly aware of their Greek origin. He teaches us his interpretation of the very term 
stochastics, playing the central role in the book and in its title. This essential notion is 
derived from Greek roots, but has been broadly used in a different way. Demetris has 
proposed new names, originating from Greek, to baptize scientific constructs, such as 
“climacogram”, “ombrian”. They are indeed better justified than the existing terms that 
are already in circulation, but it is clear from the google search counter that they are 
certainly less known yet. 

We live in the era of bibliometric indices being used as the principal, parametric, 
measure of scientific achievements of an individual scientist or a scientific institution (at 
times, even an entire country). Indeed, nowadays, citation count and Hirsch index are the 
currencies in which scientists are evaluated. Hence, a book is not a product that gets 
adequately rewarded by the bibliometric indices. So, in a way, writing a book is a sacrifice 
for the author in comparison to publication of articles in leading peer-reviewed journals 
from the top quartile of a disciplinary division of the Web of Science list, with respectful 
value of the impact factor. The very terms “bibliometry” and “bibliometric” are clearly of 
Greek origin, stemming from two words: fi BAiov (biblion) meaning a book, and the verb 
petpiaw / UEetTpEew (metriao / metreo) meaning to measure. By the way, even if the terms 
bibliometry or bibliometric refer to books by construction, in real world they now mostly 
refer to journal articles rather than books or book chapters. 

The book reads really well. It contains numerous illustrations (125). There are also a 
wealth of interesting digressions and appendices, and, finally, an excellent, truly 
international and multi-lingual, list of references, including little known works of Soviet 
or Russian scientists. 

In my view, this is the best book ever published in this area, successfully competing 
with other recognized giants. I might consider this book as a candidate to a short list of 
books I would pack in my luggage for a visit to an uninhabited island, where I would have 
much time to study it over and over again. Forty-five years ago, I would take a handbook 
entitled Probability, Random Variables and Stochastic Processes, written by another 
scientist of Greek origin, Athanassios Papoulis. Now, I would swap Papoulis by 
Koutsoyiannis. One could rightly ask, what would be the sense of taking such a book to an 
uninhabited island, where the concepts of extremes, probability, statistics and stochastic 
processes are of little practical relevance. Well, there is an internal beauty in the theory 
exposed by Demetris. There are ample illuminating examples, in particular related to 
hydroclimatic processes and extremes. Considerable time is needed to study this volume 
in detail, especially coming back to the bits and pieces that were skipped during the first 
pass. I find the book enriching and I am really confident that it would be enriching to any 
readership. 


I am pretty sure that our common friend, the late Vit Klemes, would applaud this book, 
greeting Demetris now from his cloud #17. This is how Vit projected his eternal residence 
address, even if cloud numbering, assuming some stability, is more a construct of poetry 
than a climatologically-justified notion. I see spiritual similarities between Vit and 
Demetris. Indeed, Koutsoyiannis is using pins against balloons in Kleme§’s style. 

There are ample references to the return period in the book, so I wish the readers to 
have many happy returns to the book. I am sure that the return period will be finite. Once 
in, again in. 

In Greek mythology, it was believed that drinking from the Pierian Spring of 
Macedonia, sacred to the Muses, would bring great knowledge and inspiration. I wish the 
readership to enjoy drinking from the Pierian Spring of Koutsoyiannis. 


Zbigniew W. Kundzewicz 


Corresponding Member of Polish Academy of Sciences 
Member of Academia Europaea 


Prolegomena 


In 2005, my colleague and friend at the U.S. Geological Survey, the late Timothy A. Cohn, 
and I began looking at the inherent weakness in standard approaches to testing the 
statistical significance of hydroclimatic trends. The subject was of interest because both 
of us had used trend tests in our investigations of discharge and water quality time series 
and realized that although trend magnitude was easily determined with little ambiguity, 
the corresponding statistical significance was less certain because significance depended 
critically on the null hypothesis. The latter, of course, reflects subjective assumptions 
about the underlying stochastic process. Our curiosity was fostered by an awareness that 
the standard approaches to significance testing of hydroclimatic time series were all 
based on the assumption of independence. 

We knew based on the work of Harold Edwin Hurst, and subsequently discussed by 
Mandelbrot and Wallis, Klemes, Hosking, among others, that hydroclimatic records are 
realizations of physical processes whose behaviour exhibits long-term persistence (LTP). 
Such behaviour was sometimes modelled as fractional Gaussian noise (fGn) or fractionally 
differenced ARIMA processes. Importantly, LTP is a stationary process. Our specific 
interest, however, was not in evaluating LTP, but rather in exploring what effects LTP had, 
if present, on the significance of observed trends. What we found was an effect that was 
much more noteworthy than we had imagined. In looking at a nearly 150-year record of 
northern hemisphere temperature, we found that the standard test of significance, which 
assumes no LTP, yielded a highly significant increasing trend with p-value of 1.8 x 10-27. 
We then applied a test which assumed the presence of LTP and found an increasing trend 
with p-value of 7.1 x 10-2, i.e., a trend not significant at the p = 0.05 level. In changing from 
one test to another, 25 orders of magnitude of significance vanished. This result was and 
remains somewhat troubling given the uncertainty about the stochastic process and the 
possibility that the observed temperature warming over the past 150 years could be 
explained by natural dynamics in the form of long-term persistence and non-linearity in 
the climate system. 

During our initial literature review on long-term persistence, we noticed a number of 
very recent papers by a Greek hydrologist named Demetris Koutsoyiannis whom neither 
of us knew. After digesting and discussing these papers, it became clear that Demetris had 
an unusually profound understanding of stochastics, and particularly its relevance to the 
field of hydroclimatology. Accordingly, we quickly contacted him to learn more about his 
work. We also sent him a draft of our paper and solicited his thoughts and comments. 
Demetris’ comments were astute, yet humble, and conveyed a clarity in his understanding 
of this subject that went well beyond what we had seen described by others. His 
suggestions substantially improved the manuscript, even going so far as to suggest amore 
meaningful, less contentious, yet still provocative title... Nature’s Style: Naturally Trendy. 
The paper was published in Geophysical Research Letters shortly thereafter with minor 
revisions. 
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That initial contact opened a dialog that has continued to this day and dramatically 
expanded our perspective on stochastic hydrology. Along the way, we have witnessed an 
evolution in Demetris’ own thinking on the subject; from the more conventional 
understanding of stochastics, wherein the notions of stationarity and nonstationarity are 
defined; to his innovative articulation of Hurst-Kolmogorov dynamics, which is (1) 
stationary and demonstrates how stationarity can coexist with change at all time scales, 
(2) linear, thus emphasizing the fact that stochastic dynamics need not be nonlinear to 
produce realistic trajectories, (3) simple, parsimonious, and inexpensive, and (4) 
transparent, because it does not hide uncertainty nor pretend to predict the distant future 
deterministically; to his theoretical development of stochastics in defining moments for 
use in assessing hydroclimatic extremes, a major focus of this book. 

A consummate teacher, Demetris always presents his theses with precision, logic and 
imagination. Readers will find themselves following the concept of stochastics from its 
original usage by classical Greek philosophers to its modern formulation and application 
in hydroclimatology. Its applicability, however, extends to all areas of geophysics. This 
book incorporates the contributions of many of the most influential researchers in 
stochastic hydrology over the past half-century, with a significant number of those 
contributions coming from the author himself, as well as his students and collaborators. 
It is the pinnacle of nearly two decades of scholarship from someone who has become 
recognized as the leading and most influential voice for stochastics among modern 
hydrologists, joining a very select circle of late 20th century scholars who influenced his 
early work. 

Stochastics of Hydroclimatic Extremes: A Cool Look at Risk is the single most 
authoritative discourse on the theory and application of stochastics from a geophysical 
perspective available to any interested scholar. It is an essential resource in any serious 
stochastic hydrologist’s library, and an incomparable reference for every advanced 
student in hydrology. It will undoubtedly become the standard reference on stochastic 
hydrology for decades to come. 


Harry F. Lins 


Past-President, Commission for Hydrology, World Meteorological Organization 
U.S. Geological Survey, Retired 
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But listen to the tale 

Of human sufferings, and how at first 

Senseless as beasts I gave men sense, possessed them 

Of mind. I speak not in contempt of man; 

I do but tell of good gifts I conferred. 

In the beginning, seeing they saw amiss, 

And hearing heard not, but, like phantoms huddled 

In dreams, the perplexed story of their days 

Confounded; knowing neither timber-work 

Nor brick-built dwellings basking in the light, 

But dug for themselves holes, wherein like ants, 

That hardly may contend against a breath, 

They dwelt in burrows of their unsunned caves. 

Neither of winter's cold had they fix'd sign, 

Nor of the spring when she comes decked with 
flowers, 

Nor yet of summer's heat with melting fruits 

Sure token: but utterly without knowledge 

Moiled, until I the rising of the stars 

Showed them, and when they set, though much 
obscure. 

Moreover, number, the most excellent 

Of all inventions, I for them devised, 

And gave them writing that retaineth all, 

The serviceable mother of the Muse. 

I was the first that yoked unmanaged beasts, 

To serve as slaves with collar and with pack, 

And take upon themselves, to man's relief, 

The heaviest labour of his hands: and I 

Tamed to the rein and drove in wheeled cars 

The horse, of sumptuous pride the ornament. 

And those sea-wanderers with the wings of cloth, 

The shipman's waggons, none but me devised. 

These manifold inventions for mankind 

I perfected, who, out upon't, have none,— 

No, not one shift—to rid me of this shame. 


Aeschylus, Prometheus Bound (442-471), 
Translated by G. M. Cooksont 


* http: //www.greek-language.gr/digitalResources/ancient_greek/library /browse.html?page=12&text_id=132 


t https://en.wikisource.org/wiki/Four_Plays_of_Aeschylus_(Cookson)/Prometheus_Bound 


Preface 


A year ago, a flash flood claimed the lives of 24 people in Mandra, a small town near 
Athens”. The losses are a result of lack of infrastructure for flood protection, while the 
natural stream network had been abused by urban development. If the storm had been 
predicted and if there were alert systems and evacuation plans in place, the consequences 
would not be that tragic. However, predictions for storms of small duration and extent, 
occurring at dry places, are difficult. This year some meteorologists in Greece (not the 
official meteorological service), perhaps envying the glory of American meteorologists 
who deal with storms of different type such as hurricanes—and at the same time facing 
the fact that in Greece there are no hurricanes—decided to give names to every 
meteorological depression entering Greece. The journalists received this initiative 
enthusiastically advertising the names in all media, while authorities started to react by 
closing schools in days of predicted (named) bad weather. At the very day I am writing 
these lines, the weather in Athens is wintry (as it should normally be in February). In 
preceding days, meteorological predictions spoke of an unprecedented, “historical snow 
event” (totopikds xtovids)t. But, to the forecasters’ disappointment, this so-called 
historical snowfall was, once again, not to come about. 

If meteorological predictions are difficult, especially those for a week after, what about 
climate predictions which are for really long time horizons? A few years ago, it was 
predicted that “snowfalls are now just a thing of the past’* Soon this prediction changed to 
the opposite one, “Extreme snowfall is actually an expected consequence of a warmer 
world” § However, despite their variety, reaching self-contradiction, all these predictions 
have some things in common. For most people, they are scary. And in contrast to 
Cassandra’s sorrowing prophesies, which were true but not believed by people, current 
prophesies of catastrophes usually are widely believed but very often do not come true. 

Apocalyptic prophesies have been common in history and were mostly connected — 
and owed their power—to religion. Modern prophesies are instead connected to the 
ideology of environmentalism and owe their power to scientists. However, they share 
several characteristics with prophesies of old; most prominently, the scare-mongering 
and world-saviour attitudes. Since 1970, several environmental scientists predicted lots 
of catastrophes, with which apparently God laughed and, as they did not come to pass, we 
too may laugh now’. I believe that bombarding people with negative predictions is 
detrimental to the society—and is objectively contrary to these world-saving pretences. 
It makes the society more vulnerable. This has been vividly expressed more than 2600 





* https: //en.wikipedia.org/wiki/Mandra 
t https://tvxs.gr/news/ellada/erxetai-istorikos-xionias 


+ https://web.archive.org/web/20150905124331/http://www.independent.co.uk/environment/snowfalls- 
are-now-just-a-thing-of-the-past-724017.html 

§ http://www.bbc.com/earth/story/20160127-will-snow-become-a-thing-of-the-past-as-the-climate-warms 

™ Koutsoyiannis, D., Saving the world from climate threats vs. dispelling climate myths and fears, Invited 
Seminar, Lunz am See, Austria, doi:10.13140/RG.2.2.34278.42565, WasserCluster Lunz - Biologische 
Station GmbH, 2017. 
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years ago in the Aesop’s fable originally entitled “shepherd playing” («Iowmnv taivwv»), 
better known in English as “the boy who cried wolf.”* More than 2300 years ago, Epicurus 
pronounced science as the enemy of fear and of superstition. And a couple of centuries 
before Epicurus, other philosophers such as Plato and Aristotle clarified the meaning and 
the ethical value of science as the pursuit of the truth—pursuit that is not driven by 
political agendas and economic interests. For the latter, they used the term sophistry. 

I believe what is needed is a cool look at risk. For risk exists—as it existed all the time 
in the past and will certainly exist in the future. Because of the rapid growth of population 
in the 20 century, increasing by an order of magnitude since 1800 and two orders of 
magnitude since the era of Plato and Aristotle, and becoming now a significant percentage 
(>7%) of the people that have ever lived on Earth, one would think that the risk, measured 
in terms of damages and human losses due to natural hazards, has increased. This 
however is not the case. Thanks to substantial progress in engineering and technology the 
risk has decreased. 

Engineers’ profession is tightly connected to risk. The infrastructure they build 
generally decreases risk from natural hazards but does not eliminate it. At the same time, 
infrastructure is subject to risk per se. The comedian and writer John Oliver gave it an 
interesting definition: “/nfrastructure: it’s our roads, bridges, dams, levees, airports, power 
grids—basically anything that can be destroyed in an action movie.”* Accordingly, the 
engineers’ profession is socially sensitive and responsible at an enormous degree. Unlike 
Aesop’s shepherd, an engineer cannot play with risk; the consequences, in case ofa failure 
of infrastructure or its management, are not as ecologically friendly as wolves eating 
sheep. 

Being an engineer, I have dealt with risk for decades. It is my intent to convey my 
experience to the readers of this book. Although I have published a lot of articles and made 
even more conference talks related to this subject, what is contained in the book is mostly 
new. 

One important issue that I have consistently tried to communicate is my belief that the 
current standard methodologies underestimate substantially the probability of extreme 
events. I hope I have substantiated my claims in this book. The reasons of underestimation 
are basically two. The first is an inappropriate assumption of classical statistical 
methodologies: that the different events are independent of each other. They are not. This 
will repeatedly be illustrated in the book using long records of hydrometeorological 
processes, as well as invoking theoretical arguments. The second is the assumption that 
the distribution tails of those processes exhibit a rapid decay as we go to larger and larger 
values: an exponential descent, like in the exponential or even the normal distribution. 
The inappropriateness of both these assumptions has not been widely known because the 
relevant behaviours are hidden if the time series of observations are not long enough. And 





https: //en.wikisource.org/wiki/The_Shepherd%27s_Boy_and_the_Wolf; original Greek text: 
https://el.wikisource.org/wiki/Atommou_Mv6ot/Tloun|v_tatGwv. 


t Related data are given in the last chapter of the book. 
+ Infrastructure: Last Week Tonight https://www.youtube.com/watch?v=Wpzvaqypav8. 
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both assumptions are connected to each other and act synergistically to underestimate 
the probability of occurrence of extremes and hence the risk. 

In particular, the independence assumption is virtually equivalent to a static climate. 
Accordingly, if we remove this assumption, we get a varying climate, which is consistent 
with the real-world climate. These statements may sound counterintuitive or even wrong, 
because typically dependence is interpreted as memory rather than change. Nonetheless, 
the close relationship of dependence, particularly the long-range one, with change is 
illustrated in the book both empirically and theoretically. Given, on the one hand, the 
adherence to independence in typical studies of extremes and, on the other hand, the fact 
that independence entails a static climate, it is not surprising that most recent studies try 
to remedy the consequences of the inappropriate assumptions by invoking climate 
change—or anthropogenic global warming, the global scapegoat of our era. Methods to 
embed climate change into studies dealing with occurrence probability vary, but all have 
several weaknesses—examples are provided in the book. I believe that just removing the 
independence assumption—and thus representing a changing climate without additional 
assumptions—resolves most of the underestimation problems. 

The language used in this book is the language of stochastics. This may be 
inapprehensible at first glance but it is an effective language. The book tries to adhere on 
the rigorous use of stochastics, on the one hand, and to make its presentation both easy 
and self-contained, on the other hand. In this respect, the biggest part of the book is 
devoted to the theory of stochastics which is necessary for inferences about extremes. 
Stochastics is a scientific area broader than statistics—actually, according to the 
definition I adopt, statistics is part of stochastics. Another part is the theory of stochastic 
processes, in which time has a hypostasis that is typically absent in statistics. The direct 
analogy is dynamics vs. statics. This does not mean that statistics is underrepresented in 
the book. On the contrary, several new developments are presented—most notably the 
new tool of knowable moments, which have two relevant characteristics: they are closely 
connected to extremes and their estimation is unbiased in the framework of classical 
statistics or involves small bias in stochastic processes with dependence in time, whilst 
the bias in the estimation of classical statistical moments can be huge. As it will be seen in 
the book, knowable moments help to develop an extreme-oriented fitting methodology of 
probability distributions. 

In parallel to being theoretical, the book is oriented to application. The new theoretical 
developments are supported by derivations and proofs, which to improve readability are 
contained in a number of Appendices in each chapter. The application is supported by 
several examples and illustrations, usually standing out as parenthetical sections or 
Digressions, as well as by tabulations of mathematical formulae that are used for each 
task. 


Athens, 24 February 2019 


Demetris Koutsoyiannis 
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Notational conventions 


The book follows the Guidelines for the use of units, symbols and equations in hydrology’. In turn, these 
guidelines are based on (i) the Systéme International (SI) brochuret; (ii) the ISO 80000-2 Standard, 
Mathematical Signs and Symbols to Be Used in the Natural Sciences and Technology; and (iii) Unicode 


Technical Report #25, Unicode Support for Mathematics.* We list some of the conventions here for the 
reader’s convenience. 


Physical dimensions and units 


(a) 
(b) 
(c) 


(d) 


(e) 


(f) 
(g) 


All quantities are dimensionally consistent. In particular, arguments of functions such as exp( ) and 
In() are dimensionless. 

We use s, min, h, and d for second, minute, hour and day respectively. We do not abbreviate week, 
month or year, which are non-SI units.§ 

Multiplication of units is indicated by a space, e.g. N m, and division either by negative exponents (e.g. 
ms-~) or by use of the solidus (oblique line, e.g. m/s2); however repeated use of the solidus (e.g. m/s/s) 
is not permitted. 

Prefixes of units such as M (mega = 10°) and u (micro = 10-6) have no space between (e.g. us, MW). 
According to the SI, the prefix for kilo is lower case k (e.g. km—K is the symbol of the kelvin). 

For areas and volumes, we use m? and m; the hectare (ha) and the litre (L) are also allowed in SI.A 
million m? is denoted as square kilometre (1 km? = 10° m2). A million m? is denoted as cubic hectometre 
(1 hm? = 10° m?—not 1 Mm? because 1 Mm? = 1018 m3; note that in SI any power to a unit applies also 
to the prefix); a billion m3 is denoted a cubic kilometre (1 km? = 109 m3). 

All units are typeset in upright (Roman) fonts, not italic or bold. 

Numerals are also typeset in upright fonts. The symbol for the decimal marker is the dot. To facilitate 
reading, numbers are divided in groups of three using a thin space (e.g. 12 345.6). (Note that neither 
dots nor commas are permitted as group separators). A space is used to separate the unit from the 
number (e.g. 10 m). 


Symbols and equations 


(a) 


(b) 
(c) 
(d) 
(e) 


(f) 
(g) 


(h) 
(i) 


We prefer single-letter variables (if necessary, with subscripts, e.g. Erms) over multi-letter ones. Single- 
letter variables or parameters and user-defined function symbols are italic (e.g. x, Y, B, f(x)). Multi-letter 
variables, if cannot be avoided, are typeset in upright, not italic (e.g. RMSE). 

Common, explicitly defined, functions are not italic, whether their symbols are single-letter (e.g. '(x) 
for the gamma function, B(y, z) for the beta function) or multi-letter (e.g. In x, exp(x + y)). 

Textual subscripts or superscripts are not italic (€.g. Xmax, Tmin Where ‘max’ and ‘min’ stand for 
maximum and minimum, respectively). 

Mathematical constants are upright (e.g. e = 2.718..., 1 = 3.141..., i? = -1). Also, mathematical operators 
are upright (e.g. dx in integrals and derivatives, Ay for the difference operator on y). 

Vectors, matrices and vector functions are bold and, for single-letter variables, italic. In particular, 
vectors are usually denoted with lower case letters (e.g. x, w as vectors; f(x) as a vector function of a 
vector variable) and matrices with upper case letters (e.g. A as matrix; AB as the product of matrices A 
and B, A’ as the transpose of A, det A as the determinant of a square matrix A). 

We use nested parentheses for grouping (e.g. In(a (b + c)) rather than In[a (b + c)] 

To distinguish between stochastic variables from regular variables we use the Dutch convention”, i.e., 
we underline the stochastic variables. Further, we use the curly brackets for sets (e.g. P{x < x} for a 
scalar x or P{x < x} for a vector x; note that the argument of probability (P) is a set, not a number). 

We use square brackets for expectations, variances and other operators on stochastic variables (e.g. 
E[x], var|x], cov[x, z]; note that E[x] is not a function of x and thus it should not be denoted as E(x).) 
Definitions by mathematical equations are denoted using the symbols ‘:=’ and ‘=’ (e.g. to define c as 
the sum of aand bwe writec :=a+borat+b=c). 


* Prepared by D. Koutsoyiannis and H.H.G. Savenije, 2013, doi: 10.13140/RG.2.2.10775.21922 
t Ninth edition, http://www.bipm.org/en/si/si_brochure/ 


+ http://www.unicode.org/reports/tr25 


§ We avoid ‘a’ for year, because in SI ‘a’ is the prefix atto, meaning 10-18; also it is the symbol of an ‘are’, a 
non-SI unit whose multiple hectare is accepted in SI] (1 a= 100 m2; 1 ha=100 a= 104 m2=1hm2). 


*“ Hemelrijk, J., 1966. Underlining random variables. Statistica Neerlandica, 20(1), pp.1-7. 
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Main use of single-letter symbols 


AaperpRa 


fo) 


Sy 


xFoe Fy 


== ST SCHHyt*nY*,AOR Vd oc 2 


coefficients of stochastic generators 


(as a standard, the beta function B(, )) 
autocovariance 


(as a standard, the differential operation d) 
time unit, discretization time step 
(as constant, e = 2.71828...) 


probability density function 
probability distribution function 


time lag, dimensional 
Hurst parameter (also, H, := ae 1/i and 


ae := ?_1/i° are the pth harmonic 


numbers of orders 1 and a, respectively. 


time scale, dimensional 
K-moment 


Length of observation period 
moment 

Mandelbrot parameter 

size of sample or vector 

size of sample or vector 
moment order 

probability 

moment order 

correlation coefficient 
power spectrum 

time, dimensional 

return period (as a superscript, ‘”’: transpose) 


white noise process 


frequency 


BOQ | ~woxre 


Ny 


Q2OrSN 


Wes PER PRATT 


KEM BERS YIMA DVO A_ROOCN 


x,y,z stochastic variables and process or time series w 
X,Y,Z cumulative stochastic process or time series 


Q 


time scale parameter in stochastic processes 
as Latin A 

Background measure 

as Latin B 

climacogram (as a standard, Euler’s constant, 
y = 0.577216...) 

cumulative climacogram (as a standard, the 
gamma function I'() and the incomplete 
gamma function Ia{ )). 


dimensionless location parameter in 
distributions 


dimensionless shape parameter (lower-tail 
index) in distributions (as a standard, the 
Riemann zeta function C( )) 

as Latin Z 

time lag, dimensionless 

as Latin H 

angle (phase); also ombrian parameter 
Bias correction factor 


as Latin I 

time scale, dimensionless (also cumulants) 
as Latin K 

state scale parameter in distributions 
A-coefficient 

mean, moment 

as Latin M 

similar to n (size of sample or vector) 
as Latin N 

dimensionless shape parameter in 
distributions (tail index) 


as Latin O 
(as constant, 1 = 3.14159...) 


standardized cross-climacogram 
as Latin P 

standard deviation 

sum 

time, dimensionless 

as Latin T 

structure function 

as Latin Y 

entropy production 

Entropy 


as Latin X 
climacospectrum 


frequency, dimensionless 


Chapter 1. An introduction by examples 


1.1 General setting 


We will start our journey to the hydroclimatic extremes with some illustrative examples. 
The purpose is to recognize the physical behaviours before we start discussing the 
mathematical and technical weaponry to tackle the problems about the risk related to the 
occurrence of extremes. In particular, by studying these examples we may understand 
how hard (perhaps infeasible) it is (and most probably will ever be) to deal with extremes 
using deterministic methods, while at the same time the theory of stochastics provides 
suitable means to quantify the extremes and the related risks. Generally speaking, 
deterministic approaches are popular as they are simple to understand and our education 
system is based on them, but methods offered by stochastics are much more powerful. 

An interesting example of a spectacular failure of a deterministic approach in 
quantifying extremes is the notion of the so-called probable maximum precipitation, 
which is regarded to be an upper bound of precipitation that is physically feasible. Such 
an upper bound is philosophically and scientifically inconsistent. Moreover, the methods 
devised to determine it, even though they are thought to be _ physically-based 
deterministic methods, are in fact statistical methods using bad statistics. Therefore, we 
will not consider this type of approaches, while the reader interested to see the reasoning 
about the inconsistency of these methods is referenced to Koutsoyiannis (2007) and 
Koutsoyiannis and Papalexiou (2017). 

But what is stochastics, the term appearing also in the title of the book? In the modern 
scientific vocabulary, it is used to collectively refer to (a) probability theory, (b) statistics 
and (c) stochastic processes. More loosely speaking, stochastics is the mathematics of 
stochastic variables and stochastic processes, which will formally be introduced in 
Chapter 2 and Chapter 3. However, the notion of stochastics, long before being implanted 
to the scientific vocabulary by Jacob Bernoulli, had originally appeared in ancient Greek 
philosophical texts. These appearances both enrich and elucidate the notion of stochastics 
and it is thus useful to trace back its different meanings through the history of philosophy 
and science. Relevant information is contained in Digression 1.A, which helps us to 
perceive the meaning of a stochastic approach, a rich meaning with several facets, 
including: 


e probability theoretic; 

e capable of quantifying the non-precise, the uncertain, or else the random; 
e insightful—not superficial; 

e aiming at prediction in a probabilistic sense, using information of the past; 
e targeting at calculating the mean, or expectation, of uncertain quantities. 


Naturally, once we have adopted a stochastic approach, we will deal with probabilities 
and expectations of extreme quantities, and our inferences will be based on past 
observations of the processes of interest. Thus, the examples below make use of the 
available information to make inferences of probabilistic type. But before we make 


2 CHAPTER 1 - AN INTRODUCTION BY EXAMPLES 


inferences of quantitative type, we need to (a) identify the most important characteristics 
of the process behaviour and (b) assume a model consistent with this behaviour. The 
examples discuss three types of models, namely (1) the classical probabilistic model 
according to which the different events are independent of each other, (2) a linear trend 
model and (3) a model assuming a certain type of dependence in time. By comparing these 
three models with the help of the examples, we will form a general guide with directions 
that we should follow in studying extremes, which are also the directions underlying the 
next chapters of this book. 


Digression 1.A: The meaning of stochastics 


Literally, stochastics is a term of Greek origin, stemming from the adjective ‘stochasticos’ 
(otoxaotik0s), or better its feminine gender, ‘stochastice’ (otoxaotiKn). It is generated from the 
verb ‘stochazesthai’ (otoydeo8at), which in turn comes from the noun ‘stochos’ (otoxos), 
meaning the target. 

Aristotle, in his treatise Nicomachean Ethics (written ~350 BC) uses the term stochastice in its 
original meaning, related to the target, which, according to him, is the mean: “virtue, therefore, is 
a balance [‘mesotes’], in the sense that it is able to hit [as a target - ‘stochos’] the mean”! 
Furthermore, in his treatise Rhetoric he uses the term with a metaphorical meaning, which could 
be translated into English as guessing or guesswork: “men have a sufficient natural instinct for what 
is true, and usually do arrive at the truth. Hence the man who makes a good guess at truth is likely 
to make a good guess at probabilities [stochastically].”2 

However, it was Plato who used the term with a meaning closer to the modern one, i.e., related 
to uncertainty. In his dialogue Philebus (written ~360 BC) he contrasts “arithmetic and the 
sciences of measurement’ to stochastics and parallels the latter with music, which “attains harmony 
by guesswork [...] so that the amount of uncertainty mixed up in it is great, and the amount of 
certainty small.” 3 

The contrast between stochasticity and precision is made clear later by Galenus using the 
example of a city’s clock: “When a city is being built, let the following problem be set before those 
who will inhabit it: they want to expertly know, not stochastically but precisely, on an everyday basis, 
how much time has passed, and how much is left before sunset.”4 

The connection of stochastics with prediction or forecast becomes evident in an excerpt from 
Basilius Caesariensis who contrasts a prophet to a ‘stochastes’ (otoyaotys, a noun usually 
translated into English as diviner): “On the one hand, a prophet is he who foretells the future by 
revelation of the Spirit; on the other hand, a stochastes is he who infers the future by prudence, 
comparing similar states, and by the experience of forefathers.” It seems that this comment has 
influenced later scholars (e.g. Procopius) and perhaps determined the meaning of stochastic in 
modern Greek, which is imaginative, insightful, thoughtful, cogitative, contemplative, meditative. 

The transplantation of stochastics, as an international scientific term, to the modern vocabulary 
is due to Jacob Bernoulli, evidently aware of the Greek language and literature, and in particular 
of the passage from Plato’s Philebus mentioned above. In his famous book Ars Conjectandi 
(written in Latin in 1684-89 but published after his death, in 1713) he writes: “To conjecture about 
something is to measure its probability. Therefore we define the art of conjecture, or stochastics, as 
the art of measuring the probabilities of things as exactly as possible, to the end that, in our 
judgments and actions, we may always choose or follow that which has been found to be better, more 
satisfactory, safer, or more carefully considered. On this alone turns all the wisdom of the philosopher 
and all the practical judgment of the statesman.”6 

The term was revived by Bortkiewicz (1917; Russian economist and statistician of Polish 
ancestry) and also by Slutsky (1925, 1928a,b, 1929; Ukrainian/Russian/Soviet mathematical 
statistician and economist). It appears that the prevalence in USSR of the more sophisticated term 
stochastic (over the rather equivalent term random) must have been related to political and 
ideological reasons (incongruence of randomness with the dialectic materialism: models beyond 
strict deterministic were considered with a priori suspicion; see Mazliak 2018). 
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But it was Kolmogorov (1931) who made the term popular and widespread, as he introduced 
the term stochastic process also clarifying that process means change of a certain system. 
Additionally, he used the term stationary to describe a probability density function that is 
unchanged in time (while at the same time the system state changes). Soon after, Kolmogorov 
(1933) introduced the modern and consistent definition of probability in an axiomatic manner, 
based on the measure theory (see section 2.1). 


1 «ueadtne tic dpa gotiv 1 dpETH, OTOXAOTIKH ye OCA TOU péooU» (Aristot. Nic. Eth. 1106b, translation into 
English adapted from that by H. Rackham. Cambridge, MA, Harvard University Press; London, William 
Heinemann Ltd. 1934). The notion of ‘mesotes’ (ueodtns), loosely translated as balance, middle, mean 
between a respective ‘too much’ and ‘too little’, is a key concept in Aristotle’s ethical philosophy and thus 
to hit it as a target is important for him. 


2 «ol AVOPWTOL TIPOG TO AANOEG TEPUKAOLV iKaVGG Kal TA TAE(W TUYYAVOUOL TIS GAnOElac: 610 TpOs Ta EvdoEa 
OTOXAOTIKWG EXELV TOU Opolwc EXOVTOG Kai TPOG THY GAnOELav EoTIv» (Aristot. Rh. 1.1, translation into 
English by W. Rhys Roberts, http://classics.mit.edu/Aristotle/rhetoric.1.i.html). 


3 The complete passage is: EOKPATHE: «oiov nao@v mou teyv@v dv Tic dplOuntiKy ywpiln Kai WETPNTUKTV 
Kal OTATLKHYV, Wo ETLOG EiTEIV PADAOV TO KATAAELMOLEVOV EKAOTNG Av ylyvolto. [...] TO yoUV META TADT’ EiKACELV 
Agtmot’ av Kai Tas aioOnoElc KatapEdetav EuTtEeipia Kai TLVL THLBH, Talc Tig OTOXAOTLKHG TPOCXPWLEVOUG 
duvapvEeot ac ToAAoi Téxvasc Emovo"aCovoL MEAETH Kai TOVW THY PHUNnV anEeipyaopEvas. [...] oVKODV LEOTH 
HEV TIOV LOVOLKT] TIPWTOV, TO OULPWVOV APUOTTOVOA OU LETPW AAAA UEAETNS OTOXAOUG, kai obpTaoca AvTHG 
avantiKh, TO UETPOV ExdoTNS YOPOHG TH otoxalEo Gat pEpouéevns OnpEvovoa, WOTE TOAD LEPELYHEVOV EXELV 
TO LN CAPES, OMLKpOV O€ TO BEBaLOV.» 


(SOCRATES: “For example, if arithmetic and the sciences of measurement and weighing were taken away from 
all arts, what was left of any of them would be, so to speak, pretty worthless. [...] All that would be left for us 
would be to conjecture and to drill the perceptions by practice and experience, with the additional use of the 
powers of guessing, which are commonly called arts and acquire their efficacy by practice and toil. [...] Take 
music first; it is full of this; it attains harmony by guesswork based on practice, not by measurement; and flute 
music throughout tries to find the pitch of each note as it is produced by guess, so that the amount of 
uncertainty mixed up init is great, and the amount of certainty small” (Plat. Phileb. 55e, translation by Harold 
N. Fowler; Cambridge, MA, Harvard University Press; 1925.) 

4 «modews KTICouéevns TpoKEio@w Toic oikjoovol avtHv émiotacOal BovAEcOal, UN OTOXAGTIKWG AX’ 
aKplBASG, Ep’ EKAOTNG NMEA, OTGOGOV TE TAPEAHAVOEV dN TOU Ypovovu TOU Kat’ avTHV, OTd0OV O’ UTOAOITIOV 
éotlv aypt dvoEws NAlov.» (Tadnvod Iepi Atayvwoews Kai Ospatteias tHv ev TH Ekdotov Wuyi 
Apaptnuatwy — De _ ODignotione et Curatione cujusque Animi  Peccatorum, 80, 
http://www. poesialatina.it/_ns/greek/testi/Galenus/De_animi_cuiuslibet_peccatorum_dignotione_et_cur 

atione.html]). 

5 «OUKOobV IIpopntns pEv €oTW, 0 Kata amoKdAvwuw Tod Ivetuatoc mpoayopEevwv TO UéAAov' oTOXAOTHS OF, 
0 dla ovveow €ék THG Tov Opolov mapabéoews, dia THY TElpav THY TpoAaBovtwv, TO péAAOV 
OUVTEKLaIpouEvoG.» (Basilius, Epunveia etc tov Tpo~ytny Hoaiav —Enarratio in prophetam Isaiam, 
3.102.26). 


6 “Conjicere rem aliquam est metiri illius probabilitatem: ideoque Ars Conjectandi sive Stochastice nobis 
definitur ars metiendi quam fieri potest exactissimé probabilitates rerum, eo fine, ut in judiciis & actionibus 
nostris semper eligere vel sequi possimus id, quod melius, satius, tutius aut consultius fuerit deprehensum; in 
quo solo omnis Philosophi sapientia & Politici prudential versatur” (Bernoulli, 1713). 


1.2 Introductory notes on the examples 


The examples that follow make use for some of the longest available records of 
hydroclimatic observations. Only long records reveal the secrets of hydroclimate and its 
behaviours, which seem peculiar as they are very different from our perception of 
“random events”. As we will see with the help of the examples: 


1. While classical probability and statistics adhere to the assumption that different 
events are independent, this assumption is totally inappropriate when dealing 
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with hydroclimatic processes—and most other natural and artificial processes. An 
illustration of the difference is provided in Digression 1.B. 

2. Popular “modern” approaches, such as those discovering “nonstationarity” are 
even more inappropriate. Models of this type identify mostly linear trends 
everywhere, trying to reconcile, in an inappropriate and inefficient manner, the 
disagreement between natural behaviours and those resulting from the 
independence assumption. 

3. Less popular approaches assuming dependence of events in time, in particular the 
type of dependence known as long-range dependence or persistence, can provide 
consistent quantification of extremes and the uncertainty thereof, which turns out 
to be much higher than captured by the other two alternatives. 


One may think that an approach leading to high uncertainty is unsuccessful and, in this 
respect, models of type 2 are advantageous. Indeed, such approaches have been promoted 
as physically based and consistent with the popular anthropogenic global warming 
literature and with the industry of climate models and their predictions (or projections). 
If climate model information was really incorporated in the stochastic model and if this 
information was consistent with reality, then, indeed, the resulting nonstationary model, 
in which the trend was derived by a deterministic model, would be a progress. However, 
climate model outputs in their original form (without cosmetic reformations known as 
“pias correction” or “downscaling”) are irrelevant to reality (Koutsoyiannis et al., 2008; 
Anagnostopoulos et al., 2010) particularly if we focus on extremes (Tsaknias et al., 2016). 
Attempts to incorporate climate model information within a stochastic framework in a 
consistent manner (Tyralis and Koutsoyiannis, 2017) lead to increased uncertainty or, in 
the best case, in indifferent results. For these reasons, we will not consider approaches 
based on climate model outputs in this book. 

A necessary note about the examples which follow is that they do not refer to the details 
of the marginal distribution of extremes. Certainly, this is quite an important issue that 
will be studied in subsequent chapters—and of course there is a large body of publications 
that study it. However, it is equally important to study the variation of the occurrence of 
extremes in time, which actually is the focus of the examples. This problem, which 
severely influences modelling of hydroclimatic risk and decisions related to it, has not 
been given the deserved attention in the literature, or has been dealt with using naive 
methods. 


Digression 1.B Practical difference of dependence and independence 


We assume that, based on observational data of river discharge, we have concluded that the 
probability of the event that the mean daily discharge at a certain location of a river exceeds 500 
m3/s is small, equal to 107°. Practically, this means that this event happens on the average once 
every 1000 days or once every 2.74 years. What is the probability that this event occurs for five 
consecutive days? 

Even though we have not yet defined what independence formally is (this will be done in 
section 2.5 and 3.5 and Digression 3.B), we intuitively know that the probability of independent 
events occurring all together equals the product of the probabilities of the separate events. Thus, 
under independence, the probability sought is simply (10~*)° = 107. This is an extremely low 
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probability: it means that we have to wait on the average 1015 days or 2.74 trillion years, or about 
200 times the age of the universe, to see this event happen. However, such events (successive 
occurrences of extreme events for multiday periods) have been observed in several historical 
samples. This indicates that the independence assumption is not a justified assumption and yields 
erroneous results. Thus, we should avoid such an assumption if our target is to estimate 
probabilities for periods longer than the reference period. Methodologies admitting dependence, 
i.e., based on the theory of stochastic processes, are more appropriate for such problems and will 
result in probabilities much greater than 10-15; these will be described in next chapters. 

Now let us assume that for four successive days our extreme event has already occurred, i.e., 
that the mean daily river discharge was higher than 500 m3/s in all four days. What is the 
probability that this event will also occur in the fifth day? 

Many people, based on an unrefined intuition, may answer that the occurrence of the event 
already for four days will decrease the probability of another consecutive occurrence, and would 
be inclined to give an answer in between 10-3 and 10-45. This is totally incorrect. If we assumed 
independence, then the correct answer would be exactly 10-3; the past does not influence the 
future. If we assume positive dependence, which is a more correct assumption for natural 
processes, then the probability sought becomes higher, not lower, than 10-3; it becomes more 
likely that a flood day will be followed by another flood day. 

As we will see in next chapters, similar things happen if we move from the daily scale of the 
above example to the annual scale, or even larger. For example, if several warm winters have 
occurred in a series, then the probability that the next winter would be also warm is increased— 
not decreased. Ignorance of this simple truth may have severe consequences for those who aspire 
to predict the future and those who believe their predictions. A didactic historical example is the 
failed prediction of Hitler’s meteorologist Franz Baur about the 1941-42 winter in Russia, which 
marked the Battle of Moscow. Quoting a fascinating paper by Neumann and Flohn (1987): 


Baur was requested by the headquarters of the German Air Force to distribute his long-range 
forecasts to about 25 military offices. A forecast for winter 1941-42 was issued by him, probably 
at the end of October 1941, based on regional climatology and (supposed) sun-spot-climate 
relationships. The prediction called for a normal or a mild winter. Baur's main justification for 
this rested with the assertion that never in climatic history did more than two severe winters 
occur in a row. Since both of the preceding two winters, 1939-40 and 1940-41, were severe in 
Europe, he did not expect that the forthcoming winter would also be severe. 


However, that winter, in which the first major Soviet counteroffensive of the war was launched, 
turned out to be one of the coldest in record: 


The cold outbreak of early December, coming after a cool to-cold October and November [...] 
gravely hit the German armies that were not appropriately clothed (Hitler expected to break the 
resistance of the USSR before the coming of winter) and which were not equipped with 
armaments, tanks, and motorized vehicles that could properly function even in a "normal" winter 
in the northern parts of the USSR, let alone in a winter as rigorous as that of 1941-42. 

On or about 8 December, K. Diesing, chief of the CWG and scientific adviser to the chief of the 
Weather Service of the Air Force (General Spang), asked Flohn to listen in on a second earphone 
to a telephone call to Baur. In the call, Diesing cited to Baur the reports of very low temperatures 
in the East and asked him if he maintains his seasonal forecast in face of the reports. Baur's 
response was “the observations must be wrong”. 


1.3 Precipitation and its extremes as seen in a long record 


Extreme behaviour in precipitation causes floods and droughts and therefore its study is 
very useful. Notably, even when flow records exist, rainfall probability has still a major 
role in engineering design; for instance, in major hydraulic structures, the design floods 
are generally estimated from appropriately synthesized design storms, which are rare 
extreme-rainfall events (e.g. U.S. Department of the Interior, 1987). 


6 CHAPTER 1 - AN INTRODUCTION BY EXAMPLES 


Therefore, we start our exploration of extremes with precipitation. In our example we 
study one of the longest daily rainfall records worldwide, that of Bologna, Italy (44.50°N, 
11.35°E, 53.0 m). The time series of observations is available online in the frame of the 
Global Historical Climatology Network - Daily (GHCN-Daily; Menne et al., 2012)*. It is 
uninterrupted for the period 1813-2007, 195 years in total. For the most recent period, 
2008-2018, daily data are provided by the online data repository Dext3r of ARPA Emilia 
Romagna, Rete di monitoraggio RIRER.+ With these additional data, the record length 
becomes 206 years. The analyses that follow are based on the GHCN 195-year data set, 
while the most recent 11-year data are used for validation purposes. 

Figure 1.1 depicts the daily time series as well the (right-aligned) moving averages and 
moving maxima for a time window of 10 years, representing the 10-year climatic values 
(for clarification of the meaning of climatic in our context see Digression 1.C). The most 
spectacular behaviour shown in the figure is the changing climate: The 10-year climatic 
average daily rainfall has been changing between a minimum of 1.2 mm (having occurred 
in the 1820s) anda maximum of 2.5 mm (having occurred at the decade ending in 1902)— 
more than twice the minimum. At the same time the 10-year climatic value of the 
maximum daily rainfall has varied between a minimum of 48.5 mm (having occurred in 
the 1820s) and a maximum of 155.7 mm (having occurred in the 1930s)—more than 
three times the minimum. These changes do not follow a linear pattern but have the form 
of long-term non-periodic fluctuations, up and down. In the most recent years, after 1950, 
there is a roughly increasing trend in both climatic indices, but such increasing trends 
were also observed before 1900, followed by drops thereafter. 

A popular approach to deal with such changing patterns is to assume linear trends; 
publications adopting this approach abound (see Iliopoulou and Koutsoyiannis, 2020). A 
linear trend is presumably a deterministic model (even though we use the data to fit it), 
as it describes the change of the mean of the process as a deterministic linear function of 
time. Here it is pertinent to recall the good practice of fitting deterministic models to data, 
which is typical for hydrological modelling, albeit commonly overlooked in fitting trend 
models. This practice is the so-called split sample testing, in which the available record is 
split into two segments one of which is used for calibration and the other for validation, 
as emphatically suggested by Klemes (1986). 

We have applied the split-sample technique to the annual values of some indices 
extracted from the Bologna rainfall record. These are: 


e the annual total precipitation, i.e., the sum of daily precipitation from all (wet) days 
of each year; 


* GHCN Version 3; data retrieved on 2019-02-17 from https://climexp.knmi.nl/gdcnprcp.cgi?WMO= 
ITE00100550. 


t Data retrieved on 2019-02-17 from http://www.smr.arpa.emr.it/dext3r/. In particular, the data from the 
station Bologna Idrografico (coordinates 44.499883°N, 11.346156°E, 84.0 m, practically the same as those 
given for the GHCN station (except a 31 m difference in the elevation, perhaps indicating that this particular 
station is located at the roof of a building), were used except for year 2008 for which no data are provided 
for this station. For this year, as well as for very few days with missing values in other years, the daily 
precipitation values of the station Bologna Urbana (44.500754°N, 11.328789°E, 78.0 m) were used instead. 
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e the annual maximum daily precipitation, i.e., the greatest of all daily rainfall depths 
over the (wet) days of a specific year; 

e the probability dry, i.e., the ratio of the days with zero precipitation to the total 
number of a year’s days (365 or 366); and 

e the annual average wet-day precipitation, i.e, the ratio of the annual total 
precipitation to the number of wet days. 


The annual maximum daily rainfall is related to the generation of floods. At the other end 
of extremes, as the annual minimum daily rainfall does not vary but it is always zero, an 
index of extreme behaviour is the probability dry, related to occurrence of droughts. 
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Figure 1.1 Plot of the time series of daily rainfall in Bologna, along with moving averages and 
moving maxima for a time window of 10 years (right-aligned, i.e., the value plotted at a specific 
year is the average or the maximum of the previous 10 years). The lines in darker colour represent 
the GHCN time series while those in lighter colour represent the newer data which are not 
included in the GHCN time series. 


Plots of these annual indices are shown in Figure 1.2 along with fitted trends. Using the 
split-sample technique, we fitted the linear model on the mean of each index on the most 
recent part of the GHCN time series, namely the period 1950-2007. The linear trend model 
is 

u(t) =a+t+bt (1.1) 
where p is the mean of each process (index as a function of time), tis time and a and bare 


the parameters fitted by the standard linear regression method. As the simplest possible 
alternative, the constant mean model was used, i.e., 


u(t) = a = constant (1.2) 


(not shown in the graph). 
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Figure 1.2 Plots of annual indices related to the (daily) rainfall process, namely annual total 
precipitation, annual maximum daily precipitation, probability dry and annual average wet day 
precipitation, with trends fitted on the most recent part of the GHCN time series, namely the 
period 1950-2007, for which the graphs are plotted with thicker lines. For the plots of the bottom 
row, namely the probability dry and the annual average wet day precipitation, trends are also 
plotted for the earliest 25-year period, 1813-1837. The newer data that are not included in the 
GHCN time series are plotted with dotted lines. 


Two validation periods were used, namely the earlier period 1813-1949 of the GHCN 
time series, and the next period with the newer data of 2008-18, not contained in the 
GHCN time series. The comparison of the two models for each of the two validation 
periods is made in Table 1.1 in terms of the root mean square error in each case, defined 
as 





n 
1 
Epos = —) (te — Ur)? (1.3) 
T=1 
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where x, denotes the tth item of the observed time series and yw, = u(t). Clearly, the 
comparison shows that the simpler, constant-mean model outperforms the linear model 
in all cases and in both validation periods. The inferior performance of the linear model is 
also seen visually in Figure 1.2. Therefore, we have no good reason to choose the linear- 
trend model. 


Table 1.1 Root mean square error for the two validation periods and the two models, linear trend 
and constant mean, fitted to the calibration period (1950-2007). 








Annual total Annual maxi- Probability Annual average 
precipitation mum daily pre- dry (-) wet-day pre- 
(mm) cipitation (mm) cipitation (mm) 
Validation period 1813-1949 
Assuming linear trend 206.9 36.8 0.194 6.12 
Assuming constant mean 204.0 21.8 0.076 2.38 
Validation period 2008-2018 
Assuming linear trend 138.3 16.3 0.064 1.54 
Assuming constant mean L277 8.7 0.053 0.85 





Actually, there are additional reasons not to choose the linear-trend model, even if its 
performance was good. These are related to the poor logical background (or complete 
lack thereof) in using time per se as an exploratory variable in a natural process, as well 
as in fundamental concepts of stochastics, namely stationarity and ergodicity, which 
despite being fundamental are widely misunderstood. These concepts will be discussed 
in Chapter 3, while the reasons for excluding such models (including the exceptions in 
which such models are valid) are discussed elsewhere (Koutsoyiannis, 2011a; Montanari 
and Koutsoyiannis, 2014; Koutsoyiannis and Montanari, 2015). And even assuming that 
there were no theoretical obstacles and inferior performance, again we might adopt the 
constant mean model because of its parsimony (Iliopoulou and Koutsoyiannis, 2020). 
Specifically, philosophical reasoning (principle of parsimony, also known as Occam’s 
razor) and practical considerations (model uncertainty) suggest preferring the more 
parsimonious model (Gauch, 2003). Quantification of such comparison of the model, 
which is not given here, is routinely done within stochastics—cf. the Akaike (1973, 1974) 
criterion and the Bayesian information criterion (Schwarz, 1978), as well as Serinaldi and 
Kilsby (2015), Serinaldi et al. (2018), and Iliopoulou and Koutsoyiannis (2020). 

But even without these theoretical reasons, one can easily understand the absurdity of 
the linear-trend model by examining Figure 1.2. For example, if we assumed that the 
record of measurements did not reach that back in time and we had adopted the linear- 
trend model for the annual maximum daily precipitation, we would conclude that about 
1800 there was no intense rainfall at all, and that in the 18 century the precipitation was 
negative. 

To further this example, let as make a thought experiment and assume that in the 
beginning of the 19" century there lived in the area three scientists, Drs A, Band C. DrA 
kept records of the dry days of each year and Dr B observed the storm severity. In the 
1830s, Dr A cast the prediction that rainfall will totally cease by 1850. In contrast Dr B 
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said that storms become more severe and by 1850 the storm activity will be tripled at 
least. Then came Dr C who reconciled the two theories stating that dry becomes drier and 
wet becomes wetter, and that the storms are much more severe while the regular rainfall 
events are becoming more and more rare, and will soon disappear. Now if we look again 
at Figure 1.2, in particular the bottom panels where trends are also plotted for the 25-year 
period 1813-1837, we will understand that these claims would stand, if we were ready to 
accept the trend model as a decent one. Fortunately, however, scientists of our modern 
epoch do not use such naive approaches to make groundless predictions, particularly of 
catastrophic or even apocalyptic events’. 

Now having rejected the widespread practice of fitting linear trends, the question is, 
can we think a better alternative? Apparently, the answer is positive within stochastics. 
Otherwise, it would be a big failure, because the behaviour seen in our rainfall example is 
neither a peculiarity of rainfall, nor one of Bologna. It is quite common everywhere, even 
though we often do not see it for at least two reasons: (a) we do not have long enough 
records and (b) we are misled by the fact that we learnt probability by examples such as 
idealized (not even real) dice and roulettes. 

In an idealized die the probabilities of the six possible outcomes are always the same, 
irrespective of the results of previous throws. Macroscopically this simple system 
undergoes no change at all. That is, if we take the moving average of very many outcomes, 
we will have a flat line. In real-world processes the situation is different. There is change 
all the time and over all scales. Also, all events depend on each other. Dependence and 
change are closely related. We will see this relation later on (Digression 3.B). For now, we 
may take a note that dependence should not be interpreted as memory, as typically seen 
in literature, but as change. In particular, long-range dependence is not long-term 
memory but long-term change. 

How is change quantified in stochastics? A simple way would be to describe some of 
the statistical characteristics as deterministic functions of time, but this is neither so 
effective nor rational, as we have seen in this Bologna rainfall example. Another option is 
to make this quantification in a stochastic, rather than deterministic, manner. In this case 
we view the change as variability across different time scales. In turn, the variability is 
quantified in terms of the variance. 

Referring again to the annual time series of rainfall indices of Bologna for the entire 
206-year period, which we denote as Xj, Xz, ..., X29¢, We take the following steps: 


e We calculate the variance estimate 7(1), where ‘1’ indicates the time scale of 1 
year, as: 


Gi — MP t+ Gn— AP gm tt en 


aaa = (1.4) 


7) = 





* By the way, by examining the frequency of word usage in books with the help of Google’s Ngram Viewer 
(https://books.google.com/ngrams/), we see that the word ‘catastrophic’ was practically not used in the 
19 century but its frequency per million of words increased linearly in the 20 century to reach a recent 
peak value of 3.6. The usage of the word ‘apocalyptic’ peaked in 1808 with a frequency of 2.8 and again 
more recently in 1995 with a frequency of 3.6 (Koutsoyiannis, 2013b). 
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where {i is the estimate of the mean and n = 206 is the sample size. The notion of 
estimate will be clarified later on, in Chapter 4. 
e We form a time series at time scale 2 (years) by averaging pairs of consecutive 
items of the time series, i.e.: 
YO ss Xy + X2 LON pe. X3 + Xq Ope X205 + X206 
1 2 BOND. 2 yore e103 2 


and we calculate the estimate of the variance 7(2) in a similar manner. 


(1.5) 


e Werepeat the same procedure to form time series at time scales 3, 4, ..., up to scale 
20 (1/10 of the record length) and calculate the variances 7(3),/(4),... 7(20). 

e We plot (in double logarithmic axes) the variance /(x) asa function of time scale 
K. 


The function of the variance vs. time scale is called the climacogram* (Koutsoyiannis, 
2010). If we have assumed a model for our process and we determine the variance, y(k), 
from the model, we have the theoretical climacogram. If we estimate the variance, 7(k), 
from a time series, then we have the empirical climacogram. Notably, if we have produced 
the times series from a model, the empirical climacogram will not necessarily coincide 
with the theoretical, because there is estimation bias. To make them coincide, we must 
subtract the bias from the theoretical climacogram. This is not difficult because, once we 
know the model, the bias is readily determined from that model by a simple and explicit 
relationship (see section 4.6). 

Now, if the time series x, represented the so-called white noise, i.e., a pure random 
process, in which all events are independent of each other, the double logarithmic plot of 
the climacogram would be a straight line with slope -1; the proof is straight forward (see 
equation (3.49)). In real-world processes, the slope is different from -1, designated as 
2H — 2, where H is the so-called Hurst parameter which takes on values in the interval (0, 
1). We will see later on (section 3.7) that H is identical to the entropy production in 
logarithmic time. The case where this slope is constant for all time scales, corresponds to 
a simple scaling behaviour (e.g. Koutsoyiannis, 2006b), or the power law: 


v(x) = 20 (1.6) 


which defines the Hurst-Kolmogorov (HK) process, a name giving credit to Hurst (1951), 
who was the first to discover this behaviour in natural processes, and to Kolmogorov 
(1940) who was the first to introduce the process as a mathematical object. 


* The term climacogram, from the Greek KAiuaxdypauua, deriving from KAiuaé (climax = scale, as well as 
ladder; pl. xAiuaxec) and ypdupa (gramma = written, drawn), was coined in Koutsoyiannis (2010) and could 
be translated in English as scale(o)gram, but the latter term is used for another concept. Climacogram should 
not be confused to climatogram which has another meaning related to climate and, specifically, the climatic 
regime of temperature and precipitation at a site or area. The term KAiua (climate) was first used in the 
Hellenistic period by Hipparchus (see Digression 1.C), in relationship to the slope of the sun's rays, and is 
different from the other derivative noun kAiuaé. Interestingly though, both miuaé and kAiua are eventually 
etymologized from the same verb xAivewy (klinein = to slope). 
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It is easily seen that the value H = 1/2 corresponds to white noise as the slope is -1. 
High values of H (> 1/2) indicate enhanced change at large scales, else known as long-term 
persistence, or strong clustering (grouping) of similar values. This is quite common in 
natural processes. Low values H (< 1/2) indicate quite a different behaviour, called 
antipersistence. This is often confused with a periodic behaviour and hence called quasi- 
periodic (because the period of fluctuations is not constant). Such behaviour is much less 
frequent in hydroclimatic processes. 
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Figure 1.3 Empirical and theoretical climacograms of annual indices of daily rainfall at Bologna: 
(left) probability dry; (right) annual average wet-day precipitation. 

Now we apply this method on the annual indices of daily Bologna rainfall. Figure 1.3 
depicts the climacograms of the probability dry and the annual average wet-day 
precipitation. In both cases, the observed behaviour is spectacularly different from white 
noise while the Hurst-Kolmogorov behaviour is evident with Hurst parameter H as high 
as 0.95 for the probability dry and 0.90 for the wet-day precipitation. The situation is 
somewhat more complex for the annual total rainfall (not shown in Figure 1.3), in which 
the slope is different for small and large scales, an effect already known and analysed in 
Markonis and Koutsoyiannis (2016). The slope for large scales again suggests a strongly 
persistent behaviour with Hurst parameter H = 0.86. The annual maxima series tend to 
hide the Hurst behaviour, as explained in Iliopoulou and Koutsoyiannis (2019) and indeed 
the estimated H in this case is much smaller, ~0.60 (again not shown in Figure 1.3). 

The Bologna precipitation example, as well as those that follow and many others, help 
shape a classification of change shown in the hierarchical chart of Figure 1.4. In simple 
systems (left part of the graph) the change is regular, either periodic or aperiodic. Regular 
change in simple cases is predictable in deterministic terms, using equations of dynamical 
systems. But this type of change is rather trivial. More interesting are the more complex 
systems at long time horizons (right part of the graph), where change is unpredictable in 
deterministic terms, or random. Pure randomness, like in classical statistics, where 
different variables are identically distributed and independent, is a useful model for 
idealized dice experiments, but in most natural systems it is inadequate. A structured 
randomness, like in the HK process, should be assumed instead. The structured 
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randomness is enhanced randomness, expressing enhanced unpredictability of enhanced 
multi-scale change. 









































Figure 1.4 Classification of change (from Koutsoyiannis 2013b). 


Digression 1.C: What is climate? 


As is the case with stochastics (Digression 1.A) the concept of climate is an old one. Aristotle in 
his Meteorologica describes the climates on Earth in connection with latitude but he uses a 
different term, crasis (kpd@oic!, literally meaning mixing, blending of things which form a 
compound, temperament).2 The term climate (kAiua, plural KAiuata) was coined as a geographical 
term by the astronomer Hipparchus? (190 -120 BC). He was the founder of trigonometry but is 
most famous for his discovery and calculation of the precession of the equinoxes (uetaTITWoIsG 
ionveptwv) by studying measurements on several stars. In the 20 century, this precession would 
be found to be related to the climate of Earth and constitutes one of the so-called Milankovitch 
cycles. The term climate originates from the verb kAivewv, meaning ‘to incline’ and originally 
denoted the angle of inclination of the celestial sphere and the terrestrial latitude characterized 
by this angle (Shcheglov, 2007). 

Hipparchus’s Table of Climates is described by Strabo the Geographer (63 BC - AD 24), from 
whom it becomes clear that the Climata of that Table are just latitudes of several cities, from 16° 
to 58°N (see Shcheglov, 2007, for a reconstruction of the Table). However, Strabo himself uses the 
term climate with a meaning close to the modern one.* Furthermore Strabo, defined the five 
climatic zones, torrid, temperate and frigid, as we use them to date.® 

The term climate was used with the ancient Greek geographical meaning until at least 1700 as 
imprinted in a dictionary of that era.6 In contemporary times, a search on old books’ reveals that 
the term climatology appears after 1800. With the increasing collection of meteorological 
measurements, the term climate acquires a statistical character as the average weather. Indeed, 
the geographer AJ. Herbertson (1907) in his book entitled “Outlines of Physiography, an 
Introduction to the Study of the Earth”, gave the following definition of climate, based on, but also 
distinguishing it from, weather: 
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By climate we mean the average weather as ascertained by many years’ observations. Climate 
also takes into account the extreme weather experienced during that period. Climate is what on 
an average we may expect, weather is what we actually get.8 


Herbertson also defined climatic regions of the world based on statistics of temperature and 
rainfall distribution, a work that was influential for the famous and most widely used K6ppen 
(1918) climate classification; this includes six main zones and eleven climates which are on the 
same general scale as Herbertson's (Stamp, 1957). Herbertson's definition is kept virtually 
without essential changes till now; for example, Lamb (1972) states: 


Climate is the sum total of the weather experienced at a place in the course of the year and over 
the years. It comprises not only those conditions that can obviously ‘near average’ or ‘normal’ but 
also the extremes and all the variations. 


Modern scientific glossaries also provide similar definitions of climate. We quote a few: 


e By the USA National Weather Service: 
Climate - The composite or generally prevailing weather conditions of a region, throughout the 
year, averaged over a series of years.? 


e By the Climate Prediction Center of the latter: 
Climate - The average of weather over at least a 30-year period. Note that the climate taken over 
different periods of time (30 years, 1000 years) may be different. The old saying is climate is what 
we expect and weather is what we get.1° 


e By the American Meteorological Society"!, 

Climate - The slowly varying aspects of the atmosphere-hydrosphere-land surface system. It is 
typically characterized in terms of suitable averages of the climate system over periods of amonth 
or more, taking into consideration the variability in time of these averaged quantities. Climatic 
classifications include the spatial variation of these time-averaged variables. Beginning with the 
view of local climate as little more than the annual course of long-term averages of surface 
temperature and precipitation, the concept of climate has broadened and evolved in recent 
decades in response to the increased understanding of the underlying processes that determine 
climate and its variability. 


In turn, the climate system is defined as: 

The system, consisting of the atmosphere, hydrosphere, lithosphere, and biosphere, determining 
the earth's climate as the result of mutual interactions and responses to external influences 
(forcing). Physical, chemical, and biological processes are involved in the interactions among the 
components of the climate system. 


e¢ By the WMO (1992): 
C0850 climate - Synthesis of weather conditions in a given area, characterized by long-term 
statistics (mean values, variances, probabilities of extreme values, etc.) of the meteorological 
elements in that area. 
C0900 climate system - System consisting of the atmosphere, the hydrosphere (comprising the 
liquid water distributed on and beneath the Earth's surface, as well as the cryosphere, i.e. the 
snow and ice on and beneath the surface), the surface lithosphere (comprising the rock, soil and 
sediment of the Earth's surface), and the biosphere (comprising Earth's plant and animal life and 
man), which, under the effects of the solar radiation received by the Earth, determines the climate 
of the Earth. Although climate essentially relates to the varying states of the atmosphere only, the 
other parts of the climate system also have a significant role in forming climate, through their 
interactions with the atmosphere. 


e By the IPCC (2013b): 
Climate - Climate in a narrow sense is usually defined as the average weather, or more rigorously, 
as the statistical description in terms of the mean and variability of relevant quantities over a 
period of time ranging from months to thousands or millions of years. The classical period for 
averaging these variables is 30 years, as defined by the World Meteorological Organization. The 
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relevant quantities are most often surface variables such as temperature, precipitation and wind. 
Climate in a wider sense is the state, including a statistical description, of the climate system. 


A useful observation is that all definitions use the term “average” (an exception is the definition 
by Lamb who uses the loose term sum total with the same meaning). Thus, by its definition, 
climate is a statistical concept. And since climate is not static but dynamic, it is better to think of 
it as a stochastic concept. 

By scrutinizing these definitions, several questions may arise. A first one might be: Why “at 
least a 30-year period”? Is there anything special with the 30 years? Probably this reflects a 
historical belief that 30 years are enough to smooth out “random” weather components and 
establish a constant mean. In turn, this reflects a perception of a constant climate—and a hope 
that 30 years would be enough for a climatic quantity to get stabilized to a constant value. It can 
be conjectured that the number 30 stems from the central limit theorem (see section 2.17) and in 
particular the common (but not quite right) belief that the sampling distribution of the mean is 
normal for sample sizes over 30 (e.g. Hoffman, 2015). Such a perception roughly harmonizes with 
classical statistics of independent events. This perception is further reflected in the term anomaly 
(from the Greek avwyuadia, meaning abnormality), commonly used in climatology to express the 
difference from the mean. Thus, the dominant idea is that a constant climate would be the norm 
and a deviation from the norm would be an abnormality, perhaps caused by an external agent. 
However, such belief is incorrect. The examples given in this chapter support the idea of an ever 
changing climate. 

A second question inspired by Climate Prediction Center’s definition is: Why the climate taken 
over 30 or 1000 years is different? The obvious reply is: Because different 30-year periods have 
different climate. This contradicts the tacit belief of constancy and harmonizes with the 
perception of an ever-changing climate. With the latter perception, Herbertson’s idea (which the 
Climate Prediction Center refers to as an “old saying”) that “climate is what we expect, weather is 
what we get” can be reformulated as “weather is what we get immediately, climate is what we get 
if you keep expecting for a long time” (Koutsoyiannis, 2011a). 

As many of the above definitions refer to weather, it is useful to clarify its meaning, noting that 
it represents a popular notion, often used with respect to its effects upon life and human activities, 
rather than a rigorously scientific one. Interestingly, in its colloquial use in Greek and Romance 
(Neo-Latin) languages, weather is almost indistinguishable from time (Greek: Katpoc; Italian: 
tempo; French: temps, météo; Spanish: tiempo, clima; Portuguese: tempo, clima). On the other 
hand, in English and Greek, weather refers to short-scale variations in the atmosphere and is 
distinguished from climate; note however that in colloquial Spanish and Portuguese there is no 
such distinction. In scientific terms, the definition given by the WMO (1992) is this: 


W0410 weather - State of the atmosphere at a particular time, as defined by the various 
meteorological elements. 


Based on the above discussion, here we attempt to give a definition of climate, which is used in 
this book, in a hierarchical manner (avoiding circular logic) starting from the concept of climatic 
system, as follows: 


¢ Climatic system is the system consisting of the atmosphere, the hydrosphere (including its solid 
phase—the cryosphere), the lithosphere and the biosphere, which mutually interact and 
respond to external influences (system inputs) and particularly those determining the solar 
radiation reaching the Earth, such as the solar activity, the Earth’s motion and the volcanic 
activity. 


e Climatic processes are the physical, chemical and biological processes, which are produced by 
the interactions and responses of the climatic system components through flows of energy and 
mass, and chemical and biological reactions. 


¢ Climate is a collection of climatic processes at a specified area, stochastically characterized for 
a range of time scales. 
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According to this latter definition—and given that the term process means change 
(Kolmogorov, 1931), climate is changing by definition. Thus, there is no need to define or use the 
term climate change; actually, this latter term, which appeared in literature only after the 1970s, 
serves non-scientific purposes (Koutsoyiannis, 2020c, 2021). Change occurs at all scales 
(Koutsoyiannis, 2013b), and there is nothing particular in any specific one, like the commonly 
assumed 30-year scale. By studying long observation series of atmospheric and hydrological 
processes, one would see that the only characteristic scale with clear physical meaning is the 
annual—beyond that there is no objective “border scale” that would support a different definition 
of climate. The above definition includes all scales beyond the annual, thus leaving out the smaller 
scales (e.g. of several minutes or days) to be associated to weather. 

The stochastic characterization, appearing in the definition of climate, includes all statistics 
used in other definitions, such as averages, variability, extremes, etc., and collectively 
encompasses all related concepts of the scientific areas of probability, statistics and stochastic 
processes (Koutsoyiannis, 2021). 

The main distinction between weather and climate is this. While weather, according to its 
definition by WMO (1992) which is kept unchanged here, refers to a particular time, climate refers 
to the entire climatic process, throughout all times. 

As stated in the WMO (1992) definition of climate quoted above, the typical use of the term 
climate relates to the atmosphere only, leaving out the other parts of the climatic system. 
However, since the climatic system includes the hydrosphere, there is no reason to exclude the 
hydrological processes from the climatic processes. Therefore, our definition includes them. 
Nevertheless, to give more emphasis on the inclusion of hydrological processes, the term 
hydroclimatic has been used even in the title of the book. This provides additional clarity or 
emphasis but it is rather a pleonasm as the hydrosphere is already included in the climatic system. 


1The same root has the modern Greek word xpaoi for wine. Yet the term is still in use today in Greek for 
derivative names related to climate such as evxpatoc (well-tempered, temperate) and evxpaoia (eucracy). 
2 [Aristot. Mete., 362b.17] «...6 Te yap Adyos Seikvuow Ott Emi TAATOG MEV [TV OiKOVHEVHV] WpLoTal, TO OE 
KUKAW OUVamTELV EVOEXETAL OLA THV KPAoy, -ov yap UTEpBAaAAEL TA kKavMaTa Kai TO Wyo kata UnKos, aAd' 
émi MAATOG, Wot’ el Uh TOU KwAVEL OaAdtINs TAHOOS, dav eivat Topevomov, —kal kata Ta paivopeva Tept 
TE TOUG TAOUG Kai TAG TOpElac’» 

“.. theoretical calculation shows that [inhabited Earth] is limited in breadth, but could as far as climate is 
concerned, extend round the Earth in a continuous belt; for it is not difference of longitude but of latitude that 
brings great variation of temperature, and if were not for the ocean which prevent it, the complete the 
complete circuit could be made. And the facts known to us from journeys by sea and land also confirm the 
conclusion...” (English translation by H.D.P. Lee, Harvard University Press, Cambridge, Mass. USA, 1952). 

3 In his Commentary on Aratus (Immapyouv twv Apadtov kat Evddéou pawopéevwv eénynoews; Shcheglov, 
2007). 

4 [Strab. 1.1] «mavtes, dool Téomwv (dl6tTnT ac AE€yel EmtiyElpOvoW, OlkElwco TpoodMtovtat kai Tv OUpaviwv 
kai yewetpiac, oxnuata Kai peyéOn Kai dmootiuata kai KAivata dniodvtes Kai OdArN Kai Woyn Kai aTA@S 
THV TOU TEPLEXOVTOG PUCLV.» 

“Every one who undertakes to give an accurate description of a place, should be particular to add its 
astronomical and geometrical relations, explaining carefully its extent, distance, degrees of latitude, and 
‘climate’—the heat, cold, and temperature of the atmosphere.” (English translation by H.C. Hamilton, and W. 
Falconer, M.A., 1903) 

5 [Strab. 2.3] «attn 6& TH Eig tac [mévtEe] CwWvacg UEplou@ AauPBdver THv olkeiav OicKplow: ai TE yap 
katewpvypéevat dvo THv EdAEuptv Tov OdATouG UTayopEvouow Eic Liav TOU MEPLEXOVTOG UO ovvayopEval, at 
Te EUKpatot TapamAnoiws eic¢ ulav THv pEoOtNTa cyovtal, ic 6€ THY AoiTHV H AoiTH pia Kai OlakeKavpevy.» 
“In the division into [five] zones, each of these is correctly distinguished. The two frigid zones indicate the want 
of heat, being alike in the temperature of their atmosphere; the temperate zones possess a moderate heat, and 
the remaining, or torrid zone, is remarkable for its excess of heat.” (English translation by H.C. Hamilton, and 
W. Falconer, M.A., 1903). Notice the use of the Aristotelian crasis (xkp@otc) in the term eUxpator (temperate) 
zones. 

6 The following definition appears in Moxon (1700): “Climate, From the Greek word Clima. of the same 
signification; it is a portion of the Earth or Heaven contained between two Parallels. And for distinction of 
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Places, and different temperature of the Air, according to their situation; the whole Globe of Earth is divided 
into 24 Northern, and 24 Southern Climates, according to the half-hourly encreasing of the longest days; for 
under the Equator we call the first Climate: from thence as far as the Latitude extends, under which the longest 
day is half an hour more than under the Equator, viz. 12 hours and an half is the second Climate: where it is 
encreased a whole hour, the third Climate: and so each Northerly and Southerly Climate respectively hath its 
longest day half an hour longer than the former Climate, tillin the last Climate North and South, the Sun Sets 
not for half a year together, but moves Circularly above the Horizon.” 


7 https: //books.google.com/ngrams/graph?content=climatology. 


8 Thus Herbertson appears to be the father of the famous quotation “climate is what we expect, weather is 
what we get’, often attributed to Mark Twain. What Twain has actually written, attributing it to an 
anonymous student, is “Climate lasts all the time and weather only a few days’; see 
https: //quoteinvestigator.com/2012/06/24/climate-vs-weather/. 


° https: //w1.weather.gov/glossary/index.php?letter=c 
10 https: //www.cpc.ncep.noaa.gov/products/outreach/glossary.shtml#C 
11 http://glossary.ametsoc.org/wiki/Climate 


1.4 Temperature and its extremes as seen in a long record 


Next, we study temperature data of the same site, Bologna, Italy (coordinates same as in 
the GHCN station above), again one of the longest temperature records worldwide, which 
has been thoroughly studied for that reason. The time series of average daily temperature 
is available online in the frame of the European Climate Assessment & Dataset (ECAD; 
Klein Tank et al., 2002).* It is uninterrupted for the period 1814-2003, 190 years in total. 
For the most recent period, 2004-2018, daily data are provided by the online data 
repository Dext3r, described above.t With these additional data, the record length 
becomes 205 years. The analyses that follow were based on the ECAD 190-year data set, 
while the most recent data were used for validation purposes. Additional time series for 
earlier periods that go back to 1715 have been compiled and made available online by 
Camuffo et al. (2017a,b), but they were not used in this study except as background 
information. 

Figure 1.5 shows plots of the time series of daily temperature, along with moving 
averages and moving maxima and minima for a time window of 10 years (right-aligned), 
representing the 10-year climatic values. We may first observe that the temperature has 
varied from -13 to 34.2 °C, a range of 47.2 °C, which would be much higher than 50 °C if 
we also considered the diurnal variation. The minimum value of -13 °C occurred on 
January 1830 and the maximum of 34.2 °C on August 2017. This latter value is thus not 
contained in the ECAD time series, whose maximum is 33.8 °C, occurring on August 1947. 
If we focus on the 10-year climatic values we will see again change, which however is 
small compared to the 47.2 °C range. Specifically, the 10-year climatic average daily 
temperature has been changing between 12.6 °C (for the 10-year period ending in 1861) 


* Data retrieved on 2019-02-17 from https://climexp.knmi.nl/ecatemp.cgi?WMO=169. 


+ In particular, the average daily temperature values of the station Bologna Urbana (44.500754°N, 
11.328789°E, 78.0 m) were used (note that no temperature data are provided for Bologna Idrografico, 
which was used for rainfall). The data at Bologna Urbana were adjusted by adding a constant temperature 
difference of 0.19 °C to become consistent with those of the ECAD station. To find this adjustment, as there 
is no common period of observation between the ECAD station and Bologna Urbana, a third station whose 
observations have common periods with both, namely the Bologna Meteo station (44.501223°N, 
11.328197°E, 80.0 m) was used. 
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and 15.6 (for 2007). At the same time, the 10-year climatic value of the maximum daily 
temperature has varied between 29.6 °C (for 1904) and 34.2 °C (in 2016 or 33.8 °C in 
1947). Finally, the 10-year climatic value of the minimum daily temperature has varied 
between -13 °C (for 1830) and —2.4 °C (in 2007 or -3.8 °C in 1917). 


Temperature (°C) 





1800 1850 1900 1950 2000 


Figure 1.5 Plot of the time series of daily temperature in Bologna, along with moving averages 
and moving maxima and minima for a time window of 10 years (right-aligned). The lines in darker 
colour represent the ECAD time series while those in lighter colour represent the data of the most 
recent years which are not included in the ECAD time series. 

As in precipitation, the climatic changes of temperature do not follow a linear pattern 
but have the form of long-term non-periodic fluctuations, up and down. After 1970 the 
trends are increasing for average, maximum and minimum temperatures, but such 
increasing trends were also observed in other periods (most prominently after 1900), 
lasting several decades and followed by drops thereafter. As shown in Figure 1.6, the 
recent trends for the 35-year period 1969-2003 are very intense. Interestingly, by 
examining graphs of mean annual temperature for earlier periods, before 1814, published 
in Camuffo et al. (2017a,b), we note that there was an equally (or even more) intense 
increasing trend between 1740 and 1780, preceded by an even more rapid decreasing 
trend from 1720 to 1740. Thus, the minimum temperature in the last 300 years was 
observed in 1740. 

However, if we follow the split-sample logic expounded in section 1.3, we will reject 
the linear-trend model. Even the visual information in Figure 1.6 suffices to realize its bad 
performance for the early period, as well as the more recent period, after 2003. 
Furthermore, Figure 1.7 tells the same story as in precipitation (section 1.3): The Hurst 
behaviour is evident, with a Hurst parameter H = 0.94 for the annual average temperature 
and H = 0.74 for the annual maximum daily temperature. 
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Figure 1.6 Plots of annual average, maximum and minimum daily temperature in Bologna, with 
trends fitted on the most recent 35-year part of the ECAD time series representing the most 
warming period 1969-2003, for which the graphs are plotted with thicker lines. The newer data 
that are not included in the ECAD time series are plotted with dotted lines. 
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Figure 1.7 Empirical and theoretical climacograms of annual indices of daily temperature at 
Bologna: (left) annual average; (right) annual maximum daily. 
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1.5 Asevere drought in a historical context 


Discussions about droughts have been intense in the 215 century, triggered by climate 
change fears as well as by the severity of some droughts that have occurred: in Australia 
(2001-09), California (2011-17; Griffin and Anchukaitis, 2014) and Europe (2003, 2015; 
Hanel et al., 2018). Nonetheless, even though the 215'-century droughts in Europe have 
been broadly regarded as exceptionally severe, the Hanel et al. (2018) study shows that 
they were much milder in severity and areal extent in comparison to many older extensive 
drought events in Europe. 

About a decade before these droughts, a prolonged and severe one hit Greece. It 
particularly influenced the Athens water supply system and shook society. Despite that, 
the resulting water crisis is not as famous as the current economic crisis in Greece. 
Certainly, the reason for not being famous is the very successful management of the water 
crisis, in contrast to the economic crisis. Indeed, the entire campaign to handle the 
drought in Athens was very successful and, despite the long (7-year) duration (1988-95) 
and severity of the drought, there was neither one day of system failure (cf. Koutsoyiannis, 
2011a); all inhabitants had water in their tap all the time. 

Here we will study the hydrological conditions behind this water crisis using 
streamflow data for one of the major three catchments that supply water to Athens, 
namely, the Boeoticos Kephisos River at the Karditsa station (close to the outlet to 
Karditsa tunnel; catchment area 1930 km2). The monthly runoff time series we use 
(compiled by Koutsoyiannis et al., 2007 and updated by Makropoulos et al., 2018 and 
Efstratiadis et al., 2019), is the longest streamflow time series in Greece, beginning in 
1907 and uninterrupted since then (112 years up to 2018-19; note that the convention of 
a hydrological year is used, from October of previous year to September of the current 
year). In contrast to floods whose study requires high temporal resolution data, the 
monthly time scale is more than sufficient for studying droughts. 

The 112-year monthly series of river discharge is shown in Figure 1.8, along with the 
10-year moving average (right-aligned; left panel), as well as a linear trend fitted to the 
latest 50-year period before the beginning of the drought, i.e., the period 1937-87. It is 
seen in the left panel of the figure that, after the drought period, the climatic value of 
streamflow recovered (increased), but not to the level that was before the 1980s. The 
trend model would predict that the falling trend would continue. 

Comparison of the two models introduced in section 1.3, the linear-trend model and 
the constant mean model, is given in Table 1.2 for two validation periods, before and after 
the calibration period. The constant-mean performs better. Furthermore, if, in spite of 
that, we preferred the trend model and if we plan for a period of, say, 50 years in the 
future, we must think what we will do as we approach the end of the planning period. For 
extrapolation of the trend will give negative streamflow at 2060, forty years from now. 
This is similar to the early trend discussed in section 1.3, according to which the 
probability dry in Bologna would become 1 just after 1850. Therefore, it is again better 
not to trust the linear trend model. Later on (section 4.10), we will discuss how to make 
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better future predictions for a specified prediction horizon with the constant-mean model 
along with Hurst-Kolmogorov dynamics. 
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Figure 1.8 Plots of the time series of monthly average discharge of Boeoticos Kephisos, with (left) 
10-year moving averages (right-aligned) and (right) trend fitted to the period 1937-87 (the 50- 
year period before the beginning of the drought). 


Table 1.2 Root mean square errors (in m3/s) for the two validation periods for the linear-trend 
model and the constant-mean model, fitted to the calibration period (1937-87). 








Validation period 1907-37 1987-2019 
Assuming linear trend 13.4 T2ef 
Assuming constant mean 9.3 10.3 
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Figure 1.9 (left) Plot of the (right-aligned) moving average of the Boeoticos Kephisos discharge 
for the time scales noted in the legend; the time locations of the observed minima at each scale 
are also shown with dashed lines of the same colour as the corresponding moving-average time 
series. (right) Close up of the left panel for years 1980-2000. 


It is useful to study in more detail the drought period. In contrast to a flash flood, a 
drought is not a rapid event but its evolution usually extends over many years. To 
characterize that evolution stochastically, we may use a multi-scale representation of the 
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time series, as we did to define the climacogram. Figure 1.9 shows such a representation 
at scales ranging from 1 to 10 years. The difference from the definition of the climacogram 
is that the values plotted in Figure 1.9 are constructed fora sliding window of length equal 
to the time scale, while in the standard definition of the climacogram the time windows 
are fixed in position. It is seen in the plots of the time series that the minima for all time 
scales for the entire period of observations are concentrated at that particular drought 
period. This is a characteristic of the HK behaviour; had the series been produced by a 
white noise model, that clustering would be quite improbable. 

Indeed, the climacogram plotted in Figure 1.10 suggests Hurst behaviour of the process 
with Hurst parameter H = 0.82. Again, the difference from white noise is substantial. This 
difference is further illustrated in the right panel of Figure 1.10 in which the return 
periods of the lowest and highest observed average discharge over time scale 1 to 10 
years. 
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Figure 1.10 (left) Empirical and theoretical climacograms of the Boeoticos Kephisos discharge 
time series; (right) return periods of the lowest and highest observed average discharge over 
time scale 1 (annual scale) to 10 years (decadal scale) assuming normal distribution. 


The concept of return period will be discussed in detail in Chapter 5. For the current 
discussion of our example, it suffices to say that, theoretically, the return period T of an 
event, which has probability P to occur in a time interval D, is related to P and D by the 
almost obvious relationship: 


pa” (1.7) 


If we consider the highest or the lowest value that have been observed in a time period 
nD (where n is the sample size), then we can empirically assign to each of them a 
probability P ~ 1/n and thus T = nD. If we change the time interval D to xD then the 
sample size of the observations becomes n/x and again the empirical return period will 
be T = nD. Thus, in our record of 112 years (n = 112, D = 1 year) the empirical return 
period of the highest or the lowest observed value can for now be assumed to be 112 
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years, regardless of the time scale we consider. (A refinement of this technique will be 
discussed in Chapter 5 and Chapter 6.) 

That is about the empirical return period. Now let us make a model for the process 
assuming normal marginal distribution with mean yw and standard deviation o at time 
scale 1 (year), and time dependence consistent with the HK model. The estimates of these 
parameters for the HK model from the 110-year sample of annual values are pw = 11.69 
m3/s, 0 = 5.56 m3/s and H = 0.82. The method proposed in Koutsoyiannis (2003) was used 
for this estimation. For scales k > 1 the normal distribution is preserved and so does the 
mean, while, according to equation (1.6), the standard deviation o(k) = ./y(x) will 
decrease according to a(x) = o/x1~". Therefore, for each scale we can determine the 
theoretical mean and standard deviation, find the theoretical probabilities of the highest 
and lowest values xy and x,,i.e.,Py = P{x > xy} andP;, = P{x < xi}, respectively, from 
the distribution function of the normal distribution, and determine the return period T 
from equation (1.7). The results of this exercise are visually shown in the right panel of 
Figure 1.10, where an agreement of theoretical and empirical distributions (T = 112 
years) is observed. An underestimation of the theoretical return period of the lowest 
values for time scales 1-3 years is attributed to the fact that the normal distribution is not 
good enough for the lower distribution tail, as it is not bounded by 0, as it should; this 
deficiency ceases for larger scales, as the ratio o/u becomes smaller. All in all, the story 
told by the graph for the case that we assumed the HK model is that, in whatever time 
scale, the severe drought was as severe as expected for a 112-year period. Nothing more 
severe than expected. 

Now let us assume that an expert on extremes, acting in 1995—around the end of the 
drought—was asked by water managers to assess the severity of the drought in terms of 
its return period. Further, let us assume that our extreme expert was ignorant of the HK 
behaviour and used classical statistics, as usually extreme experts do. Apart from that, let 
us assume that he adopted the same approach as above except the HK behaviour, which 
is equivalent to assuming H = 0.5. The expert at that time, based on the data and ignoring 
the estimation bias, which is absent in classical statistics, would estimate for the annual 
scale the mean as fl = 12.56 m3/s and the standard deviation as o = 5.01 m?/s, which are 
not quite different from the estimates given before. However, by assuming independence 
and going to larger time scales, the standard deviation will differ substantially and, as a 
result, the return period will elevate. As seen in Figure 1.10, according to classical 
statistics, for time scales > 6 years the return periods of the lowest values exceed 100 000 
years! Even for the largest values, high return periods are estimated, of the order of 
10 000 years. Thus, the extreme expert would conclude that something extraordinary 
extreme has happened which requires an attribution study to most probably relate it to 
anthropogenic global warming. Evidently, such attributions differ substantially from 
similar ones in previous centuries. For example, after the great flood of the Arno River in 
Florence in November 1333 (the first recorded, which killed more than 3 000 people), it 
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was chronicled by Giovanni Villani* that “the great debate in Florence was on whether the 
flood occurred for God’s will or for natural causes.” 

Therefore, it is the Hurst-Kolmogorov dynamics that characterizes the natural changes, 
restores the estimates of extremes to reality and enables a cool look at extremes and their 
uncertainty, which is useful, if not absolutely necessary, for their management. And 
indeed, the HK behaviour has been the theoretical (stochastic) backing of the modelling 
and the successful handling of the Athens drought episode. On the other hand, it is striking 
that the name “Hurst” does not even appear in recent publications related to drought 
episodes (some of which have already been cited). 


1.6 Maximum and minimum water level of the Nile 


The longest instrumental record in history is that of the water level of the Nile. The 
observations were taken at the Roda Nilometer, near Cairo (Figure 1.11). Toussoun 
(1925) published the annual minimum and maximum water levels at the Roda Nilometer 
from AD 622 to 1921. The measurements were made available on the internet 
(Koutsoyiannis, 2013b)*, also converting water levels into water depths assuming a 
datum for the river bottom of 8.80 m. During 622-1470 (849 years), the record is almost 
uninterrupted but later there are large gaps. A few missing values of minima in the period 
before 1470 (namely, of the years 1285, 1297, 1303, 1310, 1319, 1363 and 1434) were 
filled in by Koutsoyiannis (2013b) using a simple method from Koutsoyiannis and 
Langousis (2011; p. 57), refined in Pappas et al. (2014). 

The annual minimum and maximum water levels of this period are plotted in Figure 
1.12 along with their climatic values given as 30-year averages. Due to the large extent of 
the Nile basin, the climatic fluctuation shown in the figure reflects the climate evolution 
of a very large area in the tropics and subtropics. We may notice that at the 780s the 
climatic (30-year) minimum value was 1.5 meters, while at AD 1110 and 1440 it was 4 
meters, 2.5 times higher. In the lower panel of Figure 1.12 we can see a simulated series 
from a roulette wheel which has equal variance as the minimum water depth Nilometer 
series. Despite equal “annual” variability, the roulette wheel produces a static “climate”, 
while the actual climate has varied substantially over time. 

Comparing the two Nilometer series, we observe that the series of maximum water 
depths exhibits much smaller variability than that of the minimum depths. This seems 
counterintuitive at first glance but we should bear in mind that, while the minimum depth 
refers to water confined in the river banks, the maximum one refers to a wide area 
inundated by the Nile water during flooding. One may express doubts about the accuracy 
of the measurements and record keeping in that era, several centuries ago, particularly in 
view of some points in the graph that look extraordinarily low or high outliers in each of 
the time series. However, the data can be crosschecked in some instances by historical 
information. As an example, the maximum water level in AD 967 is registered as 6.78 m 


* Cronica, Tomo III, Libro XII, I; original text: “D’una grande questione fatta in Firenze se ‘I detto diluvio venne 
per iudicio di Dio o per corso naturale ...” 


t http://www. itia.ntua.gr/1351/. 
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which is the second lowest maximum water depth, 1.74 m below the local climatic average 
(see Figure 1.12). The historical information seems consistent with this extraordinarily 
low value: it is known that in AD 967 about a quarter of Egypt's population (600 000 
people) died of starvation and famine-related diseases (Fagan, 2008). 





Figure 1.11 The Roda Nilometer, near Cairo. Water entered through three tunnels and filled the 
Nilometer chamber up to river level. The measurements were taken on the marble octagonal 
column (with a Corinthian crown) standing in the centre of the chamber; the column is graded 
and divided into 19 cubits (each slightly more than 0.5 m) and could measure floods up to about 
9.2 m. A maximum level below the 16 mark could portend drought and famine and a level above 
the 19th mark meant catastrophic flood (Photos by Loai Samen and Mohamd Mubarak; Google 
maps, https://goo.gl|/maps/T8NUgoDAorkz2 and https://goo.gl/maps/dsdJHJYVv572). 


While both decreasing and increasing trends appear in both time series, with most 
prominent the increasing trend in the series of maximum depths in the 14* and early 15% 
century, their alternating and aperiodic character defies a deterministic description. On 
the other hand, the stochastic description of the changes based on the HK dynamics is 
efficient. Indeed, Figure 1.13, which depicts the empirical and theoretical climacograms 
of the two Nilometer time series, shows that the natural changes are consistent with the 
HK behaviour. 

The big length of these time series enables the validation of the HK hypothesis for a 
large range of time scales, from 1 to 84 (years). The difference from the popular white 
noise model (slope -1) is striking, as well as that of other popular models such as the 
Markov which will be discussed in section 3.11. The Hurst parameters are high, H = 0.89 
for the series of minima and H = 0.91 for the series of maxima. Similar H values have been 
estimated from the contemporary, 131-year long, flow record of the Nile (naturalized) 
flows at Aswan (Koutsoyiannis and Georgakakos, 2006). The most notable deviation of 
the empirical behaviour and the HK model, shown in Figure 1.13, appears at scale 1 year 
for the series of maxima. The difference corresponds to the occurrence of extraordinarily 
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high or low maxima at isolated years. And as discussed above, these occurrences have 


been responsible for famines with thousands of lives lost. 


In summary, the long Nilometer time series augments our confidence to the 
applicability in hydroclimatic processes of the HK behaviour which appeared in all our 


examples. According to this behaviour: 


Figure 1.12 (upper) Nile River annual minimum and maximum water depth at Roda Nilometer 
(849 and 848 values, respectively, from Toussoun, 1925, as provided by Koutsoyiannis, 2013b). 
(lower) Synthetic time series, each value of which is the minimum of m = 36 roulette wheel 
outcomes; the value of m was chosen so that the standard deviation equals that of the minima of 
Nilometer series (where the latter is expressed in metres). In all series the climatic values, given 


long-term changes are more frequent and intense than commonly perceived; 

these changes are irregular and aperiodic, appear as alternating trends that can 
persist even for centuries, and are unpredictable per se; 
future states are much more uncertain and unpredictable on long time horizons 
than implied by pure randomness. 
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as 30-year moving averages, are also plotted (right aligned). 
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Figure 1.13 Empirical and theoretical climacograms of the two Nilometer series: (left) minimum 
and (right) maximum water depth; in the left graph the empirical climacogram of the roulette 
wheel time series is also shown, which, as expected, is consistent with the white noise model. 


Chapter 2. Basic concepts of probability with focus on extreme events 


2.1 Definition of probability 


For the proper understanding and use of probability, it is very important to insist on the 
definitions and clarification of its fundamental concepts. Such concepts may differ from 
other, more familiar, arithmetic and mathematical concepts, and this may cause confusion 
or even collapse of our cognitive construction, if we do not base it on solid foundations. 
For instance, in our everyday use of mathematics, we expect that all quantities are 
expressed by numbers and that the relationship between two quantities is expressed by 
the notion of a function, which to a numerical input quantity associates (maps) another 
numerical quantity, a unique output. Probability too makes such a mapping, but, instead 
of a number, the input quantity is an event, which mathematically can be represented as 
a set. Probability is then a quantified likelihood that the specific event will occur. This type 
of representation was proposed by Kolmogorov (1933). There are other probability 
systems different from Kolmogorov’s axiomatic system, according to which the input is 
not a set. Thus, in Jaynes (2003)* the input of the mapping is a logical proposition and 
probability is a quantification of the plausibility of the proposition. The two systems are 
conceptually different but the differences lie mainly on interpretation rather than on the 
mathematical results. Here we will follow Kolmogorov’s system. 


Table 2.1 Terminology correspondence in set theory and probability theory (adapted from 
Kolmogorov, 1933) 








Set theory Events 

A=@ Event A is impossible 

A= Event A is certain 

AB = @ (or AN B= @; disjoint sets) Events A and B are incompatible (mutually 
exclusive) 

AB...N=@ Events A, B, ..., N are incompatible 

X := AB...N Event X is defined as the simultaneous 


occurrence of A, B, ..., N 

X:=A+B+...+N (or X:=AUBU...UN) Event X is defined as the occurrence of at least 
one of the events A, B, ..., N 

X:=A-B Event X is defined as the occurrence of A and, at 
the same time, the non-occurrence of B 


A (the complementary of A) The opposite event A consisting of the non- 
occurrence of A 
BCA (Bisa subset of A) From the occurrence of event B follows the 


inevitable occurrence of event A 





Kolmogorov’s approach to probability theory is based on the notion of measure, which 
maps sets onto numbers. The objects of probability theory, the events, to which probability 





* Jaynes’s book that we cite here was published posthumously in 1998. 
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is assigned, are thought of as sets. For instance, the outcome of a roulette spin, i.e. the 
pocket in which the ball eventually falls on to the wheel, is one of 37 (in a European 
roulette pockets numbered 0 to 36 and coloured black or red except 0 which is coloured 
green). Thus, all sets {0}, {1}, ... {36} are events (also called elementary events). But they 
are not the only ones. All possible subsets of Q, including the empty set @, are events. The 
set Q := {0, 1, ..., 36} is an event too. Because any possible outcome is contained in Q, the 
event 2 occurs in any case and it is called the certain event. The sets ODD := {1, 3,5, ..., 35}, 
EVEN := {2, 4, 6, ..., 36}, RED := {1, 3, 5, 7, 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, 
36}, and BLACK := Q - RED - {0} are also events (in fact, betable). While events are 
represented as sets, in probability theory there are certain differences from set theory in 
terminology and interpretation, which are shown in Table 2.1. 

According to Kolmogorov’s (1933) axiomatization, probability theory is based on three 
fundamental concepts and four axioms. The concepts form the triplet (Q, 2, P), called 
probability space, where: 


1. Qis anon-empty set, which Kolmogorov calls the basic set (sometimes also called 
sample space or the certain event), whose elements w are called elementary events 
(also known as outcomes or states). 

2. 2 is a set known as o-algebra or o-field whose elements E are subsets of 2, known 
as events. Q and @ are both members of 2, and, in addition, (a) if FE is in X then the 
complement 2 - E is in 2; (b) the union of countably many sets in 2 is also in 2. 

3. P is a function called probability that maps events (i.e., sets) to real numbers, 
assigning to each event FE (member of 2’) anumber between 0 and 1. 


The four axioms, which define the properties of P, are: 


I. Non-negativity: For any event A, P(A) 2 0. 
II. Normalization: P(Q) = 1. 
III. Additivity: For any incompatible events A and B (i.e., AB = @), P(A + B) = P(A) + 
P(B). 
IV. Continuity at zero: If A1 > Az2>...D AnD... is a decreasing sequence of events, with 
A1Az2...An... = , then limnsoP(An) = 0. 











We note that in the case that J is finite, axiom IV follows from axioms I-III; however, for 
infinite fields it should be put as an independent axiom. 


2.2 The concept ofa stochastic variable 


A stochastic variable or random variable’ is a function that maps outcomes to numbers, i.e. 
enumerates the basic set 2. More formally, according to Kolmogorov’s (1933) definition, 
a real single-valued function x(w), defined on the basic set Q, is called a random variable 
if for each choice of a real number a the set {x(w) < a} for all w for which the inequality 
x(w) < a holds true, belongs to 2. With the concept of the stochastic variable, we can 


* The two terms stochastic variable and random variable have identical meaning. Here we prefer the former, 
even though the latter is more common. 
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conveniently express events using basic mathematics. In most cases enumeration is done 
almost automatically. For instance, a stochastic variable that takes values 1 to 6 is 
intuitively assumed when we deal with a die throw experiment. 

We must be attentive that a stochastic variable is not a number but a function. 
Intuitively, we could think of a stochastic variable as an object that represents 
simultaneously all possible outcomes and only them. The following analogy may help us 
to develop intuition about stochastic variables. Let us consider the equation x3(x — 1)? = 
0. This has five roots, three of them being x = 0 and two being x = 1. What do we mean 
when we say “root of this equation”? Probably we mean both x = 0 and x = 1 and also we 
have in mind that there is not symmetry between the two; rather we would give a weight 
3/5 on the former and 2/5 on the latter. Similar is the situation with a stochastic variable 
which takes on the values 0 and 1 with probabilities 3/5 and 2/5, respectively. 

While formally a stochastic variable is a function x(w), we usually omit reference to its 
argument w and keep the symbol x. However, in this case we need to distinguish it 
symbolically from a regular variable; the best notation devised to this aim and used here 
is the so-called Dutch convention (see Hemelrijk, 1966, who mentions that it was 
introduced by D. Van Dantzig in 1947, i.e., later than Kolmogorov’s foundation of 
probability). According to it, stochastic variables are underlined, i.e. x. In this case the 
inequality {x(w) < a} used for the formal definition of the stochastic variable is written 
as {x < a}. Accordingly, {x < a} denotes an event (a subset of Q), and therefore it has a 
probability, P({x < a}). For simplicity, in the latter notation we drop the parenthesis and 
we write P{x < a}. Some texts drop the curly brackets instead of the parentheses, but this 
practice misrepresents the important fact that the argument of probability is a set. This 
notation is further explained in Digression 2.A, along with its importance. 

From a practical point of view, compared to a regular variable, a stochastic variable is 
a more abstract mathematical entity, which we use when a quantity of interest is 
something uncertain, unpredictable, unknown; this is the meaning of stochastic and 
random (cf. Koutsoyiannis, 2010; Dimitriadis et al., 2016). While a regular variable takes 
on one value ata time, a stochastic variable can be thought of as taking on all of its possible 
values at once, but not necessarily in a uniform manner; therefore, a probability 
distribution function, defined in section 2.3, should always be associated with a stochastic 
variable. A stochastic variable becomes identical to a regular variable only if it can take on 
only one value. 

When an observation of a quantity that is modelled as a stochastic variable is made, 
then this observation is usually a regular variable. For example, we model a die throw 
with a stochastic variable x with possible values 1 to 6. After a specific throw of the die 
and before we observe the outcome, we still have the same uncertainty as described by 
stochastic variable x. When we observe the outcome, it becomes a regular variable x (e.g. 
x = 5). The particular value is called a realization of x and is denoted by the non- 
underlined symbol x. This happens when our observation is exact. Sometimes the 
observation is contaminated by error—our observations are not always exact 
(particularly those of real valued variables). Then we can use another stochastic variable 
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to describe the uncertain outcome. For example, if an observer has presbyopia combined 
with astigmatism (like the author) he may not be sure whether the outcome was 5 or 4 
and he could model it as a stochastic variable z with possible outcomes 4 and 5. 
Considering a certain (deterministic) function y = g(x), mapping the regular variable 
x to the regular variable y (e.g. y = g(x) = x”), we can extend its meaning to apply to 
stochastic variables, i.e., y = g(x) (e.g. y = g(x) = x”). As implied by the notation, when 


the function’s argument x is a stochastic variable, the result y is also a stochastic variable 


(formally, it is the composite function y(w) = g(x(w)). In other words, functions of 
stochastic variables are stochastic variables. 


Digression 2.A: The importance of notation 


The following simple example shows that the common practice of not distinguishing the notation 
of regular and stochastic variables is a bad practice. Let x and y represent the outcomes of each 


of two dice. What is the probability of the following cases? 
a){x<y} W){x<y} Ofx<y} W&<y}. 


(a) There are 62 = 36 different possible combinations of outcomes of x and y. In six of them x = y. 


Due to symmetry, in half of the remaining 30, x < y. Thus: 


15 5 
lt 


(b) Nowyis a number, nota stochastic variable. For convenience we assume that y is integer, even 
though it can also be assumed to be real. If y > 6 then obviously the event {x < y} is certain. If y = 
6 then the probability of {x < y} is 5/6. Continuing like this we conclude that: 


P{x < y} = max (0, min (1, 7—*) ) 


(c) Thinking as in (b) and noting that x is anumber, assumed integer, and y a stochastic variable 
we find that: 


P{x<y}= max (0, min (1,1 -=) ) 


(d) As both x and y are numbers the expression {x < y} does not denote an event and therefore, 
strictly there is no probability associated with this expression. Loosely we may say that 
P{x < y} =1ifx < yand 0 otherwise. 

Obviously, if we did not distinguish y from y, we would not even be aware of the fact that 


IP {x < y} is anumber while P {x < y} is a function of x. 


Many texts (research articles and probability theory books) make the notational distinction of 
random and regular variables, but they use upper case letter for stochastic variables and lower 
case ones for regular variables. This practice may also be inadequate. If in our context we used 
another quantity denoted with the Greek letter y (and actually y is quite common in statistical 
texts—cf. the chi and chi-squared distributions), how would we distinguish the stochastic 
variables corresponding to x and y? (In both cases the upper case letter is X, while in our 
convention x and y are distinguishable.) Furthermore, this would be too restrictive in our use of 


mathematical symbols. For example, the symbol H used in Chapter 1 (and many other chapters) 
to denote the Hurst parameter would be an incorrect notation if we adopted the upper- vs. lower- 
case notation. Another convention was used by Papoulis (1990, 1991) who denoted stochastic 
variables in bold letters. However, the typical use of bold letters is to denote vectors. Therefore, 
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the Dutch convention of underlining the stochastic variables is the most convenient, clearest and 
safest. 


2.3 Distribution function 


According to Kolmogorov’s (1933) foundation’ of probability theory, the function of the 
real variable x, 


FQ) = P{x = x} (2.1) 


where x is a stochastic variable, is called the distribution function. We notice that the 
stochastic variable with which this function is associated is not an argument of the 
function. Even though we use the same letter for both x and x, the two are fundamentally 
different. For example, in a die throw, the stochastic variable x represents the whole 
numbers 1 to 6 and the regular variable x takes on any real value from -0°9 to +00, (The 
domain of F (x) is not identical to the range of the stochastic variable x; rather it is always 
the set of real numbers.) If there is risk of confusion (e.g., if we study a problem with many 
stochastic variables), the stochastic variable should also appear in the notation of the 
distribution function. Usually, it is denoted as a subscript: F,(x)—but we can simplify the 


notation dropping the underscore in the subscript (F,(x)) once we adopt the convention 
that the subscripts are stochastic variables. 

Typically, F(x) has a mathematical expression depending on some parameters. It is a 
non-decreasing function of x obeying the relationship: 


0 = F(-«) < F(x) < F(+m) = 1 22) 


For its non-decreasing attitude, in the English literature F(x) is also known as cumulative 
distribution function, but here we adhere to Kolmogorov’s (1933) original terminology, 
which did not contain the adjective cumulative. In practical applications the distribution 
function is also known as non-exceedance probability. Likewise, the non-increasing 
function: 


F(x) = P{x > x} =1- F(x) (2:3) 


is known as, tail function (or survival function, survivor function) and represents 
exceedance probability. 

The distribution function is always continuous on the right; however, if the basic set Q 
is finite or countable, F(x) is discontinuous on the left at all points x; that correspond to 
outcomes w,, and it is constant between them (staircase-like). Such stochastic variable is 
called discrete. If F(x) is a continuous function, then the stochastic variable is called 
continuous. A mixed case is also common; in this the distribution function has some 
discontinuities on the left, but is not staircase-like. 

For continuous stochastic variables, the inverse function F~1() of F() exists. 
Consequently, the equation u = F(x) has a unique solution for x, called u-quantile of the 
variable x, that is: 





“We note that Kolmogorov used ‘<’ in his definition but modern literature uses ‘s’ as in (2.1). 
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i= FO) (2.4) 


2.4 Probability mass and density function 
In discrete stochastic variables, the probability of each event: 
PS eae shad bay (2.5) 


where J is the number of possible outcomes (which can be infinite) is the probability mass 
function. It is easy then to see that the step (discontinuity) of the distribution function 
F (x) at point x; equals F;. 

In continuous variables there are no discontinuities and hence any particular value x 
has zero probability to occur. However, we can still tell which of two outcomes is more 
probable and by how much by examining the ratio of the two probabilities. As this is a 0/0 
expression, having in mind |’H6pital’s rule, we need to examine the ratio of derivatives of 
probabilities. 

The derivative of the distribution function is called the probability density function 
(PDF) or simply density: 





fo) = “ © (2.6) 
and its basic properties are: 
f(x) 20, [redex af (2.7) 


Obviously, the probability density function does not represent a probability; therefore, it 
can take on values higher than 1. Its relationship with probability is described by the 
following equation: 


Plxsx<sx+ Ax} 


= ji 2.8 
he os 
The distribution function can be calculated from the density function by: 
x 
Fa = | fOrdy 2.9) 


In discrete stochastic variables, the density is a sequence of Dirac 6 functions (see 
definition of 6 in equation (3.51)), while in mixed distributions Dirac 6 functions appear 
at the points of discontinuity. This text mostly deals with continuous variables but mixed- 
type variables appear in several cases as will be discussed in Chapter 6 and Chapter 8. 

Some of the most common distributions of discrete and continuous variables are 
shown in Table 2.2. Additional continuous distributions are shown in Table 2.3, along with 
their moments, while the derivation of these and other distributions in terms of the 
principle of maximum entropy is indicated in Table 2.4 and Table 2.5. 

As already discussed (section 2.2), the one-to-one mathematical transformation on x, 
y = g(x) defines a new stochastic variable y. If the function g(x) is invertible, then the 
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event {y < y} is identical to the event {x < g \(y)} where g“! is the inverse function of g. 


Consequently, the distribution functions of x and y are related by: 


Fy) = Ply sy} = P{x < 9 10)} = (910) (2.10) 
In the case that the variables are continuous and the function g differentiable, it can be 


shown that the density functions of x and y are related by: 


f.(97*(y)) (2.11) 


yO) = Tag) 


where gis the derivative of g. 


Table 2.2 Some of the simplest and most common distributions. 


Name (and Probability mass function or Probability distribution function 
parameters) probability density function 





Discrete variable x with values x; 








Discrete uniform, P( ) _ 1 . _ Werte el 
ee, ay) = 5 (x) = max(0, min([x1/J, 1)) 
Geometric . 1 ( U : F(x) cal 1 ( m " 
) = —|(— x) = max| 0,1 —-_——|__——_- 
x; = 0,1,...(u > 0) (xj) 1+p\lt+y 1+yp\lt+u 





Continuous variable x 





F(x) = max(0, min(x/J,1)) 


(A forO<x<J 


Uniform in [0, /] f(*) 0 otherwise 














meron (Gl TER eo foe ees 
x or x 

Normal 7 ae 

( ER o> 0) — 1 ex _@-we = 1 [ ex _u-w? du 

HER ona 20? Vine} 20° 


Note: |x| denotes the floor of the number x(the greatest integer less than or equal to x). 


Digression 2.B: Illustration of distribution function by an example 


For clarification of the basic concepts of probability theory, we give the following example of 
hydroclimatic interest. In particular we study (a) the occurrence of rainfall at a particular site and 
a specific time of the year, and (b) the rainfall depth at that site and time. 

In (a) we are interested on the mathematical description of the possibilities that a certain day 
in the specified site and time is wet or dry. These are the outcomes or states of our problem, so 
the basic set is: 


Q = {wet, dry} 
The field X contains all possible events, i.e.: 
x = {G, {wet}, {dry}, 2} 
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To fully define probability on 2 it suffices to define the probability of one of the two states, say 
P{wet}. In fact, this is not easy - usually it is done by induction, and it needs a set of observations 
to be available and concepts of the statistics theory (see Chapter 4) to be applied. For the time 
being let us arbitrarily assume that P{wet} = 0.2. The remaining probabilities are obtained by 
applying the axioms. Clearly, P(Q) = 1 and P(@) = 0. Since wet and dry are incompatible, P{wet} + 
P{dry} = P({wet} + {dry}) = P(Q) = 1, so P{dry} = 0.8. 

We define a stochastic variable x based on the rule 


x(dry) = 0, x(wet) =1 
We can now easily determine the distribution function of x. For any x < 0, 
F(x) = P{x <x} =0 
(because x cannot take negative values). For0 < x < 1, 
IGE TB Le oy Phe 0h 
Finally, for x = 1, 
F(x) = P{x < x} = P{x = 0} + P{x =1}=1 


The graphical depiction of the distribution function is shown in Figure 2.1 (left). The staircase- 
like shape reflects the fact that stochastic variable is discrete. 
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Figure 2.1 Distribution function of a stochastic variable representing events related to rainfall of 
a given day at a certain area and time of the year: (left) the dry or wet state; (right) the rainfall 
depth. 


In (b) the state is described by the rainfall depth which can be zero or positive. Therefore, the 
basic set is the set R* U {0}. The stochastic variable x is given by the rule x(w) = w. Again, the 
distribution function of x will be F(x) = P{x < x} = 0 for x < 0 with a discontinuity at 0, so that 
(On) — ee = 0} = 0.8. For x = 0 the distribution function will be continuous and increasing, 
approaching 1 as x > ©. To construct a plausible distribution function, without examining 
observations, we make an assumption that smaller values are more probable than higher and 
specifically that for two values x, and x2 > xy, the ratio of densities (expressing the ratio of 
probabilities according to |’H6pital’s rule) depends on the difference x» — xj, i.e., 


a = g(x%2 — x1) 





where it is easy to see that the function g( ) should be given as g(x) = f(0)/f (x). In turn, it can 
be shown (homework) that f(x) = Aexp(—Bx) where A and B are constants. By integrating 
(according to equation (2.9)) we find: 


F(x) = 5 (1 — exp(—Bx))+C 


and, since F(0t) = 0.8 and F(co) = 1, C= 0.8 and A/B = 0.2, thus: 
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F(x) = 0.2(1 — exp(—Bx)) + 0.8 


where B can be any positive number. An example is depicted in Figure 2.1 (right) for B = 1. The 
result is a modified exponential distribution (see Table 2.2), where the modification resulted from 
the fact that the distribution is not continuous everywhere but mixed. 

If this mathematical model is to represent a physical phenomenon, we must keep in mind that 
all probabilities depend on a specific location and a specific time of the year. So, the model cannot 
be a global representation of the wet and dry state of a day, nor of the rainfall depth. The model 
as formulated here is extremely simplified. It does not make any reference to the succession of 
dry or wet states in different days. This is not an error; it simply diminishes the predictive capacity 
of the model. A better model would describe separately the probability of a wet day following a 
wet day, a wet day following a dry day (we anticipate that the latter should be smaller than the 
former), etc. In addition, while the assumption on the rainfall depth leading to a mixed exponential 
distribution seems plausible at a first glance, it does not fully correspond to the empirically 
observed behaviour. There are better models than the exponential. We will discuss these issues 
in subsequent sections. 


2.5 Conditional probability, independent and dependent events 


By definition (Kolmogorov, 1933), conditional probability of the event A given B (i.e. 
under the condition that the event B has occurred) is the quotient: 





(2.12) 


Obviously, if P(B) = 0, this conditional probability cannot be defined. It follows that: 





P(AB) = P(A|B)P(B) = P(B|A)P(A) (2.13) 
From this it follows that: 
P(A|B) 
P(B|A) = P(B) P(A) (2.14) 


Equation (2.14) is known as the Bayes theorem. 

If it happens that P(A|B) = P(A), i.e., the probability of A does not depend on whether 
or not B has occurred, then the events A and B are called (stochastically) independent. In 
this case from equation (2.12) it follows that: 


P(AB) = P(A)P(B) (2.15) 


Otherwise, A and B are called (stochastically) dependent. 
The definition can be extended to many events. Thus, the events Aj, A),... are 


independent (or mutually independent) if for any finite set of distinct indices ij, iz, ..., in: 
P(Ai, Ai, ---Ai,,) = P(Ai,) P(Ai,) + P(Ain) (2.16) 


Thus, handling probabilities of independent events is easy. However, this is a special case 
because usually natural events are dependent. In handling dependent events the notion 
of conditional probability is vital. 

It is easy to show that the generalization of (2.16) for dependent events takes the 
forms: 
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P(Ap A) = P(An|An—1 ---41)P(A2|41)P(Ax) 247) 


P(A, A1|B) = P(An|An—1 «41B)P(Az|A,B) P(A; |B) (2.18) 


which are known as the chain rules. It can also be proved (homework) that if A and B are 
mutually exclusive, then 


P(A+ BIC) = P(A|C) + P(BIC) (2.19) 
P(C|A)P(A P(C|B)P(B 
P(C|A+B) = EEE) | aoa : a ee) (2.20) 
andif A+B =Q,sothat P(A + B) = 1, then 
P(C) = P(C|A)P(A) + P(C|B)P(B) (2.21) 


Digression 2.C: An example on the dependence of probability on 
information 


We assume that, at a certain place on Earth (say, in a city in the United Kingdom) and a certain 
period of the year, a dry and a wet day are equiprobable and that in the different days the states 
(wet or dry) are independent. What is the probability that two consecutive days are wet under 
the following conditions? (a) Unconditionally. (b) If we know that the first day is wet. (c) If we 
know that the second day is wet. (d) If we know that one of the two days is wet. 


We denote A := {first day wet}, A := {first day dry}, B == {second day wet}, B := {second day 
dry}. The basic set is {AB, AB, AB, AB} 


(a) We seek to find P(AB). Obviously, given the independence assumption, P(AB) = P(A)P(B) = 
(1/2)? = 1/4. Because of equiprobability and independence, each of the four events has 
probability 1/4. 


(b) Now the probability sought is P(AB|A). Using the chain rule in equation (2.18) we find 
P(AB|A) = P(A|AB)P(BIA) = 1x 1/2 = 1/2. 


(c) Like in (b), we find P(AB|B) = 1/2. 


(d) The condition that one of the two days is the composite even AB + AB + AB. Thus, the 
probability sought is 
IN PA ge 
P(AB + AB + AB) P(AB+AB+AB) 3/4 3 
where we have used the definition of conditional probability and the fact that AB, AB, AB are 
mutually exclusive. 

To connect the example to the real world, let us assume that a friend travelled to this city fora 
specified couple of days. If we do not have any information except the specific dates, then to the 
event that she used her umbrella in both days we will assign probability 1/4. If we have seen (e.g. 
in her social media posts) a photo showing her in the city holding an umbrella, then to the same 
event we will assign probability 1/3. If, in addition, the photo has a time stamp on it, then we will 
change the probability to 1/2. In other words, the information we have in a problem may 
introduce dependences in events that are initially assumed independent. More generally, the 
probability is not an invariant quantity, characteristic of physical reality in absolute terms, but a 
quantity that depends on our knowledge or information on the examined phenomenon. It may 
sound paradoxical that the probability depends on information, but it is not. The rules we are 
assigning probabilities are objective and theoretically consistent—there was nothing ambiguous 
in calculating the above probabilities, based on the information given each time. We may 
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additionally recall that even in classical deterministic physics we are dealing with similar 
situations. For instance, the location and velocity of a moving particle are not absolute objective 
quantities. If we change the coordinate system, the numerical values of the coordinates and the 
velocity will also change. 


Digression 2.D An example on dependent events 


The independence assumption in the problem in Digression 2.C is obviously a poor representation 
of the physical reality. To construct a slightly more realistic model, let us assume that the 
probability of today being wet (B) or dry (B) depends on the previous day’s state (A or A). It is 
reasonable to assume that the following inequalities hold: 

P(B\A) > P(B) = 05, P(B|A) > P(B) =0.5 


Now, the problem becomes more complicated than before. Let us arbitrarily assume that 
P(B|A) = 0.6. Then the probability that both days are wet is P(AB) = P(B|A)P(A) = 0.6 xX 0.5 = 
0.3 > 1/4. For the sake of completeness, we also calculate the probabilities of the other 
combinations. From (2.21), we get P(B) = P(B|A)P(A) + P(B|A)P(A), from which we find: 

P(B\A) P(BIA) cal = el P(A|B) P(BIB) lpaml = oe ra mi Aa 

P(B\A) P(BIA) Pa P(A)I’ [P(B|A) P(BIA) PB) P(B)])’ LP(B)I LP(A) 
where for convenience we have used matrix/vector representation. Thus. 

_ P(B)—P(B|A)P(A) 0.5-0.6x0.5 
P(B|A) = —___——__ = ——_— = 0.4 
Ge P(A) 0.5 


Hence P(AB) = P(B|A)P(A) = 0.4 x 0.5 = 0.2 < 1/4. Because of symmetry P(AB) = 0.3 and 
P(AB) = 0.2. Thus, the dependence resulted in higher probabilities that the consecutive events 
are similar and smaller probabilities that they are dissimilar. This corresponds to a general 
natural behaviour (see also Chapter 3). 


2.6 Random number generation for stochastic simulation 


One of the important scientific advances offered by stochastics in the last several decades 
is the Monte Carlo method, else known as stochastic simulation. It was originally 
developed for the numerical solution of integro-differential equations in Los Alamos in 
the framework of the Manhattan Project (Metropolis and Ulam, 1949). It can easily be 
shown (e.g. Niederreiter, 1992) that in high dimensional numerical integration (specifi- 
cally for anumber of dimensions d > 4), a stochastic (Monte Carlo) integration method (in 
which the function evaluation points are taken at random) is more accurate (for the same 
total number of evaluation points) than classical numerical integration (based on a grid 
representation of the integration space). This gave importance to the much older concept 
of random numbers, whose first appearance in a scientific publication was Tippett’s 
(1927) table, with 41 600 random digits taken from a 1925 census report. Before that 
(and even after; see Digression 3.F) random sampling was performed by means of dice 
and cards. Thus, Galton (1890) invented a set of three modified dice to generate samples 
from a normal distribution. “Student” (pseudonym of W.S. Gosset) in 1908 performed 
simulation experiments using 3000 cards (in 750 groups of size 4) to find the distribution 
of the t-statistic and of the correlation coefficient (see more information in Stigler 2002). 

With today’s meaning, a sequence of random numbers is a sequence of numbers x; 
whose every statistical property is consistent with that of realizations from a sequence of 
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independent identically distributed stochastic variables x; (adapted from Papoulis, 
1990). In turn, a random number generator is a device (typically computer algorithm) 
which generates a sequence of random numbers x; with given distribution F(x). Random 
number generation is also known as Monte Carlo sampling. 

The basis of practically all random generators is the uniform distribution in [0,1] (see 
Table 2.2). A typical procedure is the following: 


e We generate a sequence of integers q; from the recursive algorithm q; = 
(k q;-1 +c) mod m where k, c and m are appropriate integers (e.g. k = 69 069, c= 
1,m = 232= 4 294 967 296 or k = 75 = 16 807, c=0, m= 231-1 = 2 147 483 647; 
Ripley, 1987, p. 39). 

e We calculate the sequence of random numbers u; with uniform distribution in 
[0,1] by uj = qi/m. 


Obviously, this is a simple algorithm, purely deterministic. Why the numbers it 
generates are regarded as random? The answer is simple: Because if we do not know the 
algorithm and the initial condition (qo or q; _ ;) we cannot predict these numbers. As most 
algorithms, like this one, are purely deterministic, sometimes the numbers are called 
pseudorandom. But this implies the idea that there exists another category of true or 
genuine random numbers. Even though in the literature references to true random 
numbers abound, this may reflect a misunderstanding of the notion of randomness and a 
dichotomic view of natural processes (cf. Koutsoyiannis, 2010; Dimitriadis et al., 2016). 
In any process of the macroscopic world, if we were able to know the “algorithm” (the 
system dynamics), and the initial conditions with full precision, the situation would be the 
same as with the simple algorithm described. The fact that we are not able to know 
precisely the algorithm of a physical process and the initial conditions does not make the 
numbers of different type. 

A more recent and better algorithm for random number generation with uniform 
distribution is the so-called Mersenne twister, which is available in most computer 
languages and software packages’. 

Once we have a random generator for the uniform distribution, we can make one for 
any distribution F(x). A direct (but sometimes time demanding) algorithm to produce 
random numbers x; from any distribution F (x) is given by: 


xi = F-1(u,) (2:22) 


where wu; is the sequence of random numbers with uniform distribution in [0,1]. 


2.7 Expectation 


Expectation is a key concept of stochastics, enabling a macroscopic view of a phenomenon 
or process in which the details are intentionally neglected. 


* For example, for Excel (which by default includes the function rand) the Mersenne twister algorithm, called 
NtRand, can be found in www.ntrand.com/download/. 
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For a discrete stochastic variable x, taking on the values x, X2, ...,x; (where J could be 
co) with probability mass function P; = P(x) = P{x = x;}, if g(x) is an arbitrary function 
of x (so that g(x) is a stochastic variable per se), we define the expectation or expected 
value or mean of g(x) as: 


Elg(z)]: = 06s) (2.23) 

Likewise, for a continuous stochastic variable x with density f(x), the expectation is 
defined as: 

E[g(x)] = | g(x) f (x)dex (2.24) 


Expected values are regular variables: for example, Ex] and E[g(x)] are constants— 
neither functions of x nor of x. That justifies the notation Ex] instead of E(x) or E(x) 
which would imply functions of x or x. 


2.8 Moments and cumulants 


For certain types of functions g(x) we get very commonly used statistical parameters, as 
specified below: 


e The noncentral moment of order p (or the pth moment about the origin): 


g(x) = xP, bp = Ex?) (2.25) 
e The mean (or the first moment): 
g(x)= x wm = Ela] (2.26) 
e The central moment of order p: 
g(x) =(-H)?, Wp = El —H)?] (2.27) 


For p = 0 and 1 the central moments are respectively 1 and 0. 
e The variance: 


goa @en SBS w7] He? (2.28) 


The variance is also denoted as var|x |; its square root o (also denoted as std[x ) is 
called the standard deviation. 


Amongst the moments of order higher than two, most commonly used are the third and 
fourth. If we standardize them by appropriate powers of o to make them dimensionless, 


we get, respectively, the coefficients of skewness and kurtosis: 
H3 _ Ha 


C= = .—— 
Se’ k “~ G4 


(2.29) 


Another dimensionless index is the ratio: 
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; C= 2.30 
(en Vv ( ) 


where the former case is always meaningful, while the latter (the inverse of the former) 
is meaningful for nonnegative stochastic variables and is called coefficient of variation. 
Central and noncentral moments are related to each other by: 
p p 
' p -i p -i,,! 
Hp => (erin te = > EY) wry (2.31) 
i=0 i=0 
where Up = Mo = 1,4, = 0,4 = uy. Proof of these relationships is given in Appendix 6-II. 
For small p they take the following forms: 


I 


Wy =o* +, 3 =U3+30°%U+H, Wy = My + 4y3u + 607%? +y* = (2.32) 
and can be inverted as follows: 
C= My —M, Wg = bg —3eQu+ 2H, by = 4 — Suu t+ 6uQu? — 3u* = (2.33) 
For ready reference, Table 2.3 provides the analytical expressions of the moments of 
some common distribution functions. 


Another useful expectation is formed by choosing g(x) = e for any t. The logarithm 
of the resulting expectation is called the cumulant generating function: 


K(t) := InE[e] (2.34) 
The power series expansion of the cumulant generating function i.e.: 
K(t) = y Ky oi (2.35) 
p=1 


defines the cumulants k,. These are related to noncentral moments of similar order by 
(Smith, 1995): 


p-1 p-1 
Mp = >. ee ) al, mS ib= > a :) Kp—ilt! (2.36) 
i=0 i=1 


For small p they take the following forms: 
Ko = ly, = 0, Ky =H, = Hh, Kz =0°, K3 = 3, K4 = Hy — 07 (2.37) 


The importance of cumulants results from their homogeneity and additivity properties. 
Namely, for a stochastic variable that is the weighted sum of r independent variables v,, 
i.e, X = a,v, +++ +a,v,, the cumulants of x are given as 


Ky Gy te tertar ae (2.38) 
where one is pth cumulant of v;. This property is quite useful in stochastic simulation 


(see Chapter 7). 
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Table 2.3 Moments of some common distributions of continuous variables. 












































Name, parameters, Probability density or Moments 
domain distribution function 
Uniform in [a, b], et ,_atb Hag Pe 
a<x<b a er ay wee mee es ae ee oe 
Exponential eet ri — ;2 1 = ly P 
oe so f@) =e" /u M=h He =H, Mp = Ph 
Gamma (x/A)S-te-*/4 3 I'(p + ¢) 
= ‘= = ' = —___"* )p 
(>0,A>0,x 20 fG) = ATO ete beni, a r@) 
2 
— 1 ; _ 2 1 2 
Weibull ne eee 20), Hm =0(1+5) » by = r(1+5)-r(a+z) 
7>0,A>0,x20 7 a p 
Lp =I (1 + 7) AP 
peaiiay exp (- G@= yy) 0 p odd 
Bebe 0) i= ae* Ske eee ere <a p even 
xER V210 : 
Lognormal (In x is 1 x\\2 2 2 2 2 
= Beas i o o o po 
Ndnd,o)) f(x) = exp ( 20° (in (3) Wy =eZA, py =e? (= — 1) Mp =e 2 a 
og >0,A>0,x 20 v2T ox 
a a - 
1=7_-¢ 42-7 AN 
Pareto! rs = 1 : (.=§)? (1 = 28) 
eas oxzo FOO=1-(148%) (tp) 
Up = =~ * 
Pareto-Burr- 1 p(L_P P 
Feller! (PBF)? ays\ ee ,  ?P (- 2.2) . 
Z>0,F>0,  F@)=1-(1+6(G)}  w=—~—yA 
A>0,x>0 6 (S¢)§ 
Dagum? aN ép 
x 
C>0,¢ > 0, re= (1486) ‘) ut, =¢(5) pB(1 — gp, gp + ¢)aP 
A>0,x 20 cM ¢ 
Extreme value ‘ 242 
_* Tl 
type I (EV1) F(x) = exp (—e7) Wa VA b=, hp = DIG) 


A>0,xER 





Extreme value i = 2T1-SHA pw = 4% (1 — 28 -T — €)”)a? 
type Il (EV2)2 F(x) = exp (- (5) ‘) e = a 2 es go (r( é) ( )*) 
E>0,A>0,x>0 a ” 


1 Also known as Pareto III and IV, Burr XII and Feller; for justification of the name PBF see Koutsoyiannis et al. (2018). 

2 The moments exist (have finite values) only for order p < 1/€; for larger p they are infinite. 

Digression 2.E: Illustration of the first four moments and related statistical 
characteristics 

The geometrical meaning of the four first moments is visualized in Figure 2.2. Essentially, the first 


moment, i.e. the mean, describes the location of the centre of gravity of the shape defined by the 
probability density function and the horizontal axis (Figure 2.2a). It is also equivalent with the 
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static moment of this shape about the vertical axis (given that the area of the shape equals 1). 
Often, the following quantities are alternatively used as location parameters: 


¢ The mode, or most probable value, xp, is the value of x for which the density f(x) becomes 
maximum, if the stochastic variable is continuous, or, for discrete variables, the probability 
mass becomes maximum. If f (x) has one, two or many maxima, we say that the distribution is 
unimodal, bi-modal or multi-modal, respectively. 

¢ The median, x95, is the value for which Rie < Xo.5} => 1/2 and Pee = Xa} = 1/2. Thus, for a 
continuous stochastic variable, a vertical line at the median separates the graph of the density 
function into two equivalent parts each having an area of 1/2. 


Generally, the mean, the mode and the median are not identical unless the density has a 
symmetrical and unimodal shape. 

The variance of a stochastic variable and its square root, the standard deviation, which has the 
same dimensions as the stochastic variable, describe a measure of the scatter or dispersion of the 
probability density around the mean. Thus, a small variance shows a concentrated distribution 
(Figure 2.2b). The variance cannot be negative; its lowest possible value is zero. This corresponds 
to a variable that takes one value only (the mean) with absolute certainty. Geometrically, the 
variance is equivalent to the moment of inertia about the vertical axis passing from the centre of 
gravity of the shape defined by the probability density function and the horizontal axis. 


A(x) f(x) 
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Figure 2.2 Graphical illustration of the geometrical interpretation of moments of a stochastic variable: (a) 
Effect of the mean. Curves (0) and (1) have means 4 and 2, respectively, whereas they both have standard 
deviation 1, coefficient of skewness 1 and coefficient of kurtosis 4.5. (b) Effect of the standard deviation. 
Curves (0) and (1) have standard deviation 1 and 2 respectively, whereas they both have mean 4, coefficient 
of skewness 1 and coefficient of kurtosis 4.5. (c) Effect of the coefficient of skewness. Curves (0), (1) and (2) 
have coefficients of skewness 0, +1.33 and -1.33, respectively, but they all have mean 4 and standard 
deviation 1; their coefficients of kurtosis are 3, 5.67 and 5.67, respectively. (d) Effect of the coefficient of 
kurtosis. Curves (0), (1) and (2) have coefficients of kurtosis 3, 5 and 2, respectively, whereas they all have 
mean 4, standard deviation 1 and coefficient of skewness 0. 


Alternative measures of dispersion are provided by the so-called interquartile range, defined 
as the difference xo75 — Xo 25, ie., the difference of the 0.75 and 0.25 quantiles (or upper and 
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lower quartiles) of the stochastic variable (they define an area in the density function equal to 
0.5). 

The third central moment is used as a measure of skewness. A zero value indicates that the 
density is symmetric. This can be easily verified from the definition of the third central moment. 
If the third central moment is positive or negative, we say that the distribution is positively or 
negatively skewed respectively (Figure 2.2c). In a positively skewed unimodal distribution, the 
following inequality holds true: x» <%o5 <u; the reverse holds for a negatively skewed 
distribution. 

The fourth central moment is used as a measure of kurtosis, a term which describes the 
“peakedness’” of the probability density function around its mode. A reference value for kurtosis 
is provided by the normal distribution, which has C;, = 3. Distributions with kurtosis greater than 
the reference value are called /eptokurtic (acute, sharp) and have typically fat tails (see below), so 
that more of the variance is due to infrequent extreme deviations, as opposed to frequent 
modestly-sized deviations. Distributions with kurtosis less than the reference value are called 
platykurtic (flat; Figure 2.2d). 


2.9 Definition and importance of entropy 


The enumeration of the basic set and hence the definition of a stochastic variable entails 
arbitrary choices and one could think of different options. In turn, expectations and 
moments depend on the option chosen. One may think of defining the function g( ) whose 
expectation is sought, in terms of the probability per se, i.e. g(x) = h(P(x)) for a discrete 
variable or g(x) =h (f(x) for a continuous variable, where h() is any specified 
function. Among the several choices of h( ), most useful is the logarithmic function, which 
results in the definition of entropy. The emergence of the logarithm in the definition of 
entropy follows some postulates originally set up by Shannon (1948). Assuming a discrete 
stochastic variable x taking on values x; with probability mass function P; = P(x;) = 
P{x = xjhJ = 1,...J, which satisfies the obvious relationship: 


J 
»5 =4 (2.39) 
j=l 


the postulates, as reformulated by Jaynes (2003, p. 347), are: 


(a) It is possible to set up a numerical measure @ of the amount of uncertainty which 
is expressed as a real number. 

(b) @is a continuous function of P;. 

(c) Ifall the P; are equal (P; = 1//) then @ should be a monotonic increasing function 
of J. 

(d) If there is more than one way of working out the value of @, then we should get the 
same value for every possible way. 


Quantification of postulate (d) is given, among others, in Robertson (1993, p. 3) and Uffink 
(1995; theorem 1), and is related to refinement of partitions to which the probabilities P; 
refer. 

From these general postulates about uncertainty, a unique (within a multiplicative 
factor) function @ results, which serves as the definition of entropy: 
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J 
[x] = E[-In P(x)] =-) PnP, (2.40) 
j=1 


Shannon’s work leading to the above definition was on information theory, but followed 
the works of Boltzmann and Gibbs in thermodynamics. Additional notes on the historical 
evolution of the entropy concept are given in Digression 2.F. We note that in classical 
thermodynamics, entropy is denoted by S (the original symbol used by Clausius), while 
probability texts use the symbol H. Here ® was preferred as a unifying symbol for 
information and thermodynamic entropy, under the interpretation that the two are 
essentially the same thing” (see Koutsoyiannis, 2013a, 2014a, even though others are of 
different opinion). 

Extension of the above definition for the case of a continuous stochastic variable x with 
probability density function f(x), where: 


| f(x)dx =1 (2.41) 


is possible, although not contained in Shannon’s (1948) original work. This extension 
presents some additional difficulties. Specifically, if we discretize the domain of x into 
intervals of size 5x, then (2.40) would give an infinite value for the entropy as 5x tends to 
zero (the quantity —In P = —In(f (x) 6x) will tend to infinity). However, if we involve a 
(so-called) background measure with density B(x) and take the ratio (f(x)5x)/ 
(B(x)5x) = f(x)/B(x), then the logarithm of this ratio will generally converge. This 
allows the definition of entropy for continuous variables as (see e.g. Jaynes, 2003, p. 375, 
Uffink, 1995): 





ae e[-m AC2) ee peed eee (2.42) 
"B (x)} ” BG ) 

The background measure f(x) can be any Seeiiiies density, proper (with integral equal 
to 1, as in (2.41)) or improper (meaning that its integral diverges); typically, it is an 
(improper) Lebesgue density, i.e. a constant with dimensions [8 (x)] = [f(x)] = [x71], so 
that the argument of the logarithm function be dimensionless. It is easily seen that for 
both discrete and continuous variables the entropy ®[x] is a dimensionless quantity. For 
discrete variables it can only take positive values, while for continuous variables it can be 
either positive or negative, depending on the assumed (x). In contrast to the discrete 
variables where the entropy for a specified probability mass function is a unique number, 
in continuous variables the value of entropy depends on the assumed f(x). 

The importance of the entropy concept relies in the principle of maximum entropy 
(Jaynes, 1957). This postulates that the entropy of a stochastic variable x should be at 


* One of the reasons for this preference is historical: for long time, entropy used to be denoted by @ (Perry, 
1903; Swinburne, 1904; Ewing, 1920), and this is still echoed in the term tephigram (T-®-gram) used in 
meteorology. 
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maximum, under some conditions, formulated as constraints, which incorporate the 
information that is given about this variable. This principle can be used for logical 
inference as well as for modelling physical systems; for example, the tendency of entropy 
to become maximal (Second Law of thermodynamics), a tendency that is the driving force 
of natural change, can result from this principle. 


Digression 2.F: The meaning of entropy 


Entropy is etymologized from the ancient Greek évtpottia (from the verb évtpéttety, to turn into, 
to turn about) but was introduced as a scientific term by Rudolf Clausius only in 1865, although 
the concept appears also in his earlier works (as described in Clausius, 1872). The rationale for 
introducing the term is explained in his own words (Clausius, 1867, p. 358, which indicates that 
he was not aware of the existence of the word évtpottia in ancient Greek): 


We might call S the transformational content of the body [...]. But as I hold it to be better to 
borrow terms for important magnitudes from the ancient languages, so that they may be adopted 
unchanged in all modern languages, I propose to call the magnitude S the entropy of the body, 
from the Greek word tpormn, transformation. I have intentionally formed the word entropy so as 
to be as similar as possible to the word energy; for the two magnitudes to be denoted by these 
words are so nearly allied their physical meanings, that a certain similarity in designation 
appears to be desirable. 


In addition to its semantic content, this quotation contains a very important insight: the 
recognition that entropy is related to transformation and change and the contrast between 
entropy and energy, where the latter is a quantity that is conserved in all changes. This meaning 
has been more clearly expressed in Clausius’ famous aphorism (Clausius, 1865): 


Die Energie der Welt ist konstant. 

Die Entropie der Welt strebt einem Maximum zu. 
(The energy of the world is constant. 

The entropy of the world strives to a maximum). 


In other words, entropy and its ability to increase (as contrasted to energy and other quantities 
that are conserved) is the driving force of change. This property of entropy has seldom been 
acknowledged; instead in common perception entropy is typically identified with disorganization 
and deterioration as if change can only have negative consequences. 

Mathematically, the thermodynamic entropy, S, is defined in the same Clausius’ texts through 
the equation dS = dQ/T, where Q and T denote heat and temperature. The definition, however, 
applies to a reversible process only. The fact that in an irreversible process dS > dQ/T makes 
the definition imperfect and affected by circular reasoning, as, in turn, a reversible process is one 
in which the equation holds. 

Two decades later (in 1877) Ludwig Boltzmann (1877; see also Swendsen, 2006) gave entropy 
a Statistical content as he related it to probabilities of statistical mechanical system states, thus 
explaining the Second Law of thermodynamics as the tendency of the system to run toward more 
probable states, which have higher entropy. The statistical concept of entropy was advanced later 
in the works of Gibbs in thermodynamics and von Neumann in quantum mechanics. 

Shannon (1948) used an essentially similar, albeit more general, entropy definition to describe 
the information content, which he also called entropy at von Neumann’s suggestion (Robertson, 
1993; Koutsoyiannis, 2011b). According to the latter definition, entropy is a probabilistic concept, 
a measure of information or, equivalently, uncertainty. A decade later, Jaynes (1957) introduced 
the principle of maximum entropy thus equipping the entropy concept with a powerful tool for 
logical inference. 

More than half a century later, the meaning of entropy is still debated and a diversity of opinion 
among experts is encountered (Swendsen, 2011). In particular, despite having the same name, 
probabilistic (or information) entropy and thermodynamic entropy are still regarded by many as 
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two distinct notions having in common only the name. The classical definition of thermodynamic 
entropy (as above) does not give any hint about similarity with the probabilistic entropy. The fact 
that the latter is a dimensionless quantity and the former has units (J/K) has been regarded as an 
argument that the two are dissimilar. Even Jaynes (2003), the founder of the maximum entropy 
principle, states: 


We must warn at the outset that the major occupational disease of this field is a persistent failure 
to distinguish between the information entropy, which is a property of any probability 
distribution, and the experimental entropy of thermodynamics, which is instead a property of a 
thermodynamic state as defined, for example by such observed quantities as pressure, volume, 
temperature, magnetization, of some physical system. They should never have been called by the 
same name; the experimental entropy makes no reference to any probability distribution, and the 
information entropy makes no reference to thermodynamics. Many textbooks and research 
papers are flawed fatally by the author's failure to distinguish between these entirely different 
things, and in consequence proving nonsense theorems. 


However, the units of thermodynamic entropy are only an historical accident, related to the 
arbitrary introduction of temperature scales (Atkins, 2007). In a recent book, Ben-Naim (2008) 
has attempted to replace altogether the concept of entropy with the concept of information. 
However, such a replacement is unnecessary or even meaningless if we accept that the two 
concepts are identical. As has recently been shown (Koutsoyiannis, 2013a, 2014a), the 
thermodynamic entropy of gases can be easily produced by the formal probability theory without 
the need of strange assumptions (e.g. indistinguishability of particles). The logical basis of the 
latter study includes the following points. 


e The classical definition of thermodynamic entropy is not necessary; it can be abandoned and 
replaced by the probabilistic definition. 

e The thus defined entropy is the fundamental thermodynamic quantity, which supports the 
definition of all other derived ones. For example, the temperature is defined as the inverse of 
the partial derivative of entropy with respect to the internal energy (see Digression 10.D). 

e The entropy retains its dimensionless character even in thermodynamics, thus rendering the 
unit of kelvin an energy unit. 

e The entropy retains its probabilistic interpretation as a measure of uncertainty, leaving aside 
the traditional but obscure ‘disorder’ interpretation. 

e The tendency of entropy to reach a maximum is the driving force of natural change. This 
tendency is formalized as the principle of maximum entropy, which can be regarded both as 
a physical (ontological) principle obeyed by natural systems, as well as a logical 
(epistemological) principle applicable in making inference about natural systems. 


Examples of deductive reasoning in deriving thermodynamic laws from the formal 
probabilistic principle of maximum entropy have been provided in Koutsoyiannis (2014a). 
Notable among them is the derivation of the law of phase transition of water (Clausius-Clapeyron 
equation) by maximizing entropy, i.e. uncertainty, at the microscopic level, yet leading to an 
expression that is virtually certain at a macroscopic level (see Digression 10.D). 


Digression 2.G: Illustration of the principle of maximum entropy 


Here we illustrate the maximum entropy (ME) principle in a few simple cases. The examples may 
look trivial. However, we must have in mind that, as already mentioned in Digression 2.F, we can 
infer with the same reasoning more interesting things, such as the saturation vapour pressure in 
the atmosphere (Digression 10.D). The logic is the same: we maximize the uncertainty with 
respect to state of a die or a water molecule. 


(a) We thus start from the simple example of determining the probabilities of the outcomes of a 
die throw. For the die the entropy is: 


P= E[-In P(z)| = -P, In P, = P2 In P2 = P3 In P3 = Ps In P4- Ps In Ps = Pe In Pe 


DEFINITION AND IMPORTANCE OF ENTROPY 49 


Considering also the equality constraint: 
P, + P2 + P3+ P4+Ps+Pe=1 
we form the objective function to maximize as: 
Av= =F iP Poin PP ines rain ea Ps in Ps ee Peat Pe Pa Paes tel) 


where a is a Lagrange multiplier. We find the partial derivatives with respect to each of the 
variables and equate them to zero, obtaining: 


OA ae = 0A 
OP, nP,;-a=0, a OP, 
Obviously, the solution of these equations yields the single maximum: 


Pi = Pz = P3 = Pa = Ps = Ps = 1/6 


=-1-InP,-a=0 


The entropy is ® = -6 (1/6) In (1/6) = In 6. In general, the entropy for J equiprobable outcomes 
is: 


=\In/J (2.43) 


It is noted that entropy and information are complementary to each other. When we know 
(observe) that the outcome is i (P;= 1, P; = 0 for j # i), the entropy is zero. 

In the above case of a fair die throw, the application of the ME principle is equivalent to the 
principle of insufficient reason (attributed to Bernoulli and Laplace). However, while the former is 
a variational law (equivalent to the solution of an optimization problem), the latter is formulated 
in terms of equations. A single variational law is always much more powerful than very many 
equations. Actually, from a variational law we derive as many equations as there are unknowns 
(even an infinite number of equations). And as we showed, in this case the principle of insufficient 
reason results from the variational ME principle, and thus there is no need at all to postulate the 
former as an additional philosophical or scientific principle. 


(b) To illustrate that the variational ME principle is more powerful, we consider the following 
variant of the problem in which uniformity is a priori excluded. Specifically, we assume that the 
die is loaded and that we have prior information that Pe = 2P;. What is the probability that the 
outcome of a die throw will be i in this case? For the ME optimization we only need to take into 
account the additional constraint, by adding to the objective function the term b(P. - 2P:) where 
b is an additional Lagrange multiplier. The solution of the optimization problem is a single 
maximum, P2 = P3 = Pa = Ps = Pe = 0. 1698 (slightly >1/6), Pi = 0.1069, P. = 0.2139. The entropy is 
® = 1.7732, smaller than in the case of equiprobability, where ®=1n6 = 1.792. The decrease of 
entropy in the loaded die derives from the additional information incorporated in the constraints. 


(c) In another example we consider a roulette wheel which is not divided into pockets but its 
outcome is a real number measured on a circular scale graded 0 to J. In this case our stochastic 
variable x is of continuous type. Assuming background measure (x) = 1, the entropy is 


J 
|x| = ~ [in f0 ferddx 
0 
Considering also the constraint (2.41) with a Lagrange multiplier a, we should maximize: 
J J 


A:=-— | Inf(x)f@)dx-a f(x)dx -1 


Finding the partial derivative with respect to fand equating it to zero we obtain: 
ae =]! =0 
By a nf-a= 
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Hence f = exp(—1—A) = constant and from the constraint we obtain that the entropy 
maximizing density is: 
ee) = 1g) (2.44) 
and the entropy is: 
® =I|n/ (2.45) 


This is the uniform distribution, given in Table 2.2. Notice that the expression of maximum 
entropy for a discrete stochastic variable (equation (2.43)) is identical to that of a continuous 
stochastic variable (equation (2.45)). 


(d) If in the uniform distribution the upper bound J tends to ©9, it becomes improper (f(x) = 0). 
Therefore, in this case we need an additional constraint to find a proper distribution. The simplest 
one that we can think of is that the distribution has a specified mean y, i.e.: 


| oy ada — 
0 


The expression of the entropy is the same as in the example (c), but the objective function to 
maximize becomes: 


co co co 


A:=-—| Inf(@)f@)dx —a f(x)dx -1]-—b xf (x)dx — pu 


Thus, 
OA 


Yam a ee 


and 


f (x) = Bexp(—bx) 


where from the two constraints we find, after the algebraic operations, that B = b = 1/y. This is 
the exponential distribution given in Table 2.2. It is very common in physics, as the mean 
constraint, from which it results, is omnipresent. For example, if x represents the kinetic energy 
of one of many particles moving in a box, we do not know the exact energy of each particle (which 
may change due to collisions, assumed to be elastic) but we may know the average yp, which is 
preserved according to the related physical principle. Consequently, the distribution of the kinetic 
energy is exponential. 


(e) If in the above example of moving particles, we limit the motion on a straight line and we 
choose as stochastic variable x not the kinetic energy but the velocity, which can be either positive 
or negative, the kinetic energy constraint is written as 


| Me )de =v 
0 


where y is twice the average kinetic energy per unit mass. The objective function to maximize 
becomes: 


co co co 


A=—| Inf(x) f(x) dx—a| | f(~)dx-1]-b x? f (x)dx —y 
| 


Thus, 
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wae al bx? = 0 
fe nf —a—bx* = 


and 
f (x) = B exp(—bx*) 


where from the two constraints we find, after the algebraic operations, that B = ,/2my, b= 1/2y. 
This is the normal distribution given in Table 2.2, with w = 0 and standard deviation o = ,//. 

In fact, all distributions of Table 2.2 turn out to be entropy maximizing distributions, either 
without a constraint or under a simple constraint in each case. The results are summarized in 
Table 2.4. 


Table 2.4 Entropy of the simplest and most common distributions of Table 2.2, which turn out to 
be entropy maximizing distributions either without a constraint or under one or two constraints. 





Name (and _ Probability density and 
parameters) distribution function 


Corresponding entropy for unit background 
measure 





Discrete variable x with values x; 








Discrete ut &[x] = InJ 

uniform, P(x ) 7 ihe 2 (the maximum among all distributions with 
4; =1:.,/ = max(0, min(|x|/),1)) Lp es) 

1 pb \V 
P(x) = ——(_) + yet 

Geometric ()) 1+u\l+u |x| =In ones = 1+ In(u+1/e) 
x =0,1,.. F(x) anne Nee en 
(u > 0) (the maximum among all distributions with 


1 L |x] 
= max 0,1-—_(-—) 
1+yu\lt+u 


xj = 0,1,..., and mean /) 





Continuous variable x 

















; 1/) forO<x<a &[x] = InJ 
Unife = x 
oO) i { 0 otherwise (the maximum among all distributions with 
: F(x) = max(0, min(x, 1)) domain [0, a]) 
_ fer" fu x20 = 
Exponential 1G) = 0 x<0 [x] , Date ee ; 
mye (the maximum among all distributions with 
Gi) Fy — { % ea domain [0, 00) and mean p) 
0 forx <0 : a 
1 Cee 1 
f@)= exp |-——> a} o[x] ==(14+In(2 Ing = 1.419 +1 
Nomial ae 2G2 [x] a +In(2m)) + Ing = 1. +Ino 
(uER, F(x) = (the maximum among all distributions with 
a> 0) . (u—p)? domain (- ©, 0), mean yw and standard 
exp | -—=~——_] du ‘ati 
all 262 deviation o) 


Note: |x| denotes the floor of the number x. 


2.10 Maximum entropy distributions 


In Digression 2.G we illustrated several simple cases of entropy maximization, in which 
we determined the entire probability mass or density function based on one or two 
constraints. We can generalize the result for a number of constraints of the form: 
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Elgol=11 i gil fcodx — y; = 0 (2.46) 


—oo 


and for any background measure Pf. In this case, after incorporating the constraints to the 
entropy with Lagrange multipliers, the expression whose maximization is sought is: 


co 


A= [of > F(x)ax -o(_ j f (x)dx — :)- ve b; UJ gi (x) f (x) dx — r) (2.47) 
Taking the partial derivative with respect to fand equating it to zero we find 
fx) - 
ERGs ~ D bigico = (2.48) 


and thus, the entropy maximizing density is: 
f(®) = AB@) exp (- ». baie) (2.49) 
i 


As we have seen in Digression 2.G, some of the most typical distributions which are 
used in a variety of scientific fields can be derived by entropy maximization using a simple 
constraint. Here we will try to get a plethora of distributions again using a single 
constraint but both with a Lebesgue background measure and a generalized one. 

The background measure f(x) determines the way of measuring the distances d 
between values of x; the Lebesgue measure corresponds to the Euclidean distance, 
d(x,x’) = |x —x'|. However, most hydrometeorological variables are non-negative 
physical quantities unbounded from above (e.g. precipitation, streamflow, 
temperature—absolute, expressed in kelvins). In positive physical quantities, often the 
Euclidean distance is not a proper metric; sometimes we use a logarithmic distance 
d(x,x’) = |In(x'/ x)|, as shown in the example below referring to precipitation depth: 


Euclidean distance Logarithmic distance 


x =0.1 mm, x’ =0.2 mm 0.1mm In 2 
x = 100 mm, x’ = 100.1 mm 0.1 mm In 1.001 
x = 100 mm, x’ = 200 mm 100 mm In 2 


Which of the second and third pairs of points is equidistant to the first one? In an attempt 
to merge (or unify) the Euclidean and logarithmic distance, we heuristically introduce 
(see Koutsoyiannis, 2014a) a background measure for nonnegative variables that is based 
on the hyperbola: 


1 
A+x 





Bx) = (2.50) 


where J is a characteristic scale parameter, which also serves as a physical unit for x. We 
will refer to it as the hyperbolic background measure and we note that for A > ©, it tends 
to the Lebesgue measure. According to this measure, the distance of any point x from 0 is: 
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x 
x 
B(x) = [a B(s)ds = Aln (1 + =) (2.51) 
0 
An example plot for B(x) is given in Figure 2.3. Its limiting properties are: 


lim B(x) = lim B(x) = fe ind tm in | 252 
lim (x) = lim (xe) =a jim =z tin = lim =z thn =Inx (2.52) 


The distance between any two points x and x’ is: 


d(x, x’) = |B(x’)- B(x)| =A |In (4) (2.53) 


1+x/a 





For small x values, i.e, x < x’ « A, the distance is d(x,x’) =Aln(1 + (x-x)/A+x)) = 
x’- x (Euclidean distance). For large values, 2«Kx<x’',d(x,x’) = Aln(x'/x) 
(logarithmic distance). We notice that both B(x) and d(x,x’) have the same units as x 
(physical consistency). 
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Figure 2.3 Illustration of the distance function B(x); the example plot of y = B(x) is for A = 10 
and shows that for small x (<A/10) B(x) is indistinguishable from x, while for large x (> 10A) 
B(x) becomes a linear function of In x. 

In the general solution (2.49) we use a single constraint for g(x) = B(x), that is 
E[B(x)] = y, where we have assumed dimensions [B(x)] = [x] = [A]. We note that 
B(x) = B’(x)/A, where the derivative B’(x) is dimensionless. Thus from (2.49) we get: 


f(x) =A o exp(—b, B(x)) I exp (-» ait) + in(@'G)) (2.54) 





A 


where b = b,A. We may notice that all quantities in the big parenthesis are dimensionless. 
Now we make the following generalizations by raising the following quantities in powers: 


(x/A) > (x/A), B(x) /A > (BOX)/A)4, B(x) > (B'GD) (2.55) 


and get 
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d 
(x)= exp (-» (=) +e (oe) (2.56) 


where 
1 


B(x) =Aln (1 + (2) ), BAe) 6 (=(1 + (=) ) i (2.57) 


After the algebraic operations we find the generalized maximum entropy distribution: 


fGo) = as ((: i @)) exp (-» (in (a+ @)))’) (2.58) 


where A’ := Ae*. 
As a special case, when J > 09, the hyperbolic background measure approaches the 
Lebesgue measure and the quantities in (2.57) become: 


esi 


BG) Sa G). R= (=) (2.59) 


Hence, the density of (2.56) becomes 


je=4en(-0 (8) 


d 


+eln (c @))) (2.60) 


or 
, c-1)e cd 
f(x) = “(@) "ep (-» (5) ) (2.61) 

The densities (2.61) and (2.58) contain as special cases most common distributions 
used in stochastics, including hydroclimatic stochastics. These special cases are listed in 
Table 2.5 in terms of their densities f(x) and tail functions F(x) = 1 — F(x). In particular, 
the density (2.61), which is derived from the Lebesgue background measure, corresponds 
to a generalized gamma distribution also listed in Table 2.5, after suitable transformation 
of its parameters. The density (2.58), which is derived from the hyperbolic background 
measure, does not yield a closed expression for F(x) in its general case, and therefore is 
not listed in Table 2.5. In this case, a sufficiently general form with a closed expression of 
F(x) is derived if we set d = 1; this is listed as the generalized (power transformed) beta 
prime distribution (where the standard beta prime corresponds to c=1). The 
generalized gamma and generalized beta prime distributions were also studied in 
Koutsoyiannis (2005a,c, where additional information for some of their characteristics 
are provided) and Papalexiou and Koutsoyiannis (2012). 

The distributions and the special cases resulting from equations (2.61) and (2.58) 
correspond to nonnegative stochastic variables, x => 0. However, in some of the cases, in 
which the variable x appears in Table 2.5 raised to power 2, the extension to the whole 
real line is direct. The distributions of this type are earmarked as “half” in the table, and 
their “full” versions (valid for all real numbers) are derived by dividing the expressions 
given in the table by 2; this case includes the normal and Student distributions. 
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Table 2.5 Special cases of maximum entropy distributions given by equations (2.61) and (2.58). 





Name Parameters f(x) F(x) =1- F(x) 





Lebesgue background measure 












































Exponential b=c=d=1 ; exp (- -) exp (- -) 
d=1/c,b =1, A genson x Ta 
Gamma’ =(c—l1)e+1 AT (2) (5) sy ( J rT@ 
: b=d=e=1 0 x, 6-1) xy$ x6 
Weibull? =e 7 (5) exp | — (5) exp | — (5) 
=1,d=2, 2 1 /x,? x 
Half’- aes ae (-; a ) 1-erf(—) 
alf- normal ) _ 1/2 Fein exp|—> (5) Ta 
Extended b=d=1,c=2, 1 x2 o-5 x2 hd - (=) 
half‘ normal? @ = (e+1)/2 ITO (5) ) exp (- (5) ) T(Z)’ 2G 
Generalized ¢ =cd,b=1, 1 ge : @) LE 4) _ ey 
gamma+* C= ((c —let 1)/cd AT(Z') \A P A Mey? = OM 
Hyperbolic background measure 
1 1 
b=c=d=1,é=1/e 1 ce X\-F 
Pareto®> “ays 
c=d=e=1,f=1/b eG =) (2) 
1 1 
Pareto-Burr- a a 1 -x\o-1 yet 2cE + x\o\ FE 
pelle PEE) See set) (1 + (3) ) (: +G) ) 
Half 1 x\)\? 
lognormal ; = 1/2 Ld =2, 2D (- 2. (in (1 i %)) ) 1 —erf (Fm (1 + -)) 
Vana 1+x/a v2 
r,(1/d) 
xy$ ee Bee 
Generalized b=e=1 td exp (- (in (1 + (3) )) ) T(1/d)’ 





legeammial Ce raya (ay CG) z=(m(1+@')) 











11 
Half Student® c=2,d=1,e=0 2 xy? 2 zE Bye (5-3) x2 
§=1/(b+e-1) ey aD ced (x) 
c=2,d= x\2\ 2072 2)" 20-2 B (Gre) 
Half extended PEA ten 2((3) ) (1+ *) ) ae 20’ 2é oe ey 
Student €=1/(b+e-1) 4B (G7ige) B(x¢-3¢) ‘ A 
Generalized d=1,¢=c x wt xs “Ere % rated) 
betaprime = ¢’ = (1/(c — 1)e + 1), oa eG) fry WO) 6o'$5 = y=() 
(GBP)° E=1/(b+e-1) 1B(az) ce a a 





Note: Distributions named “half” have their “full” version whose density f(x) and exceedance F(x) is obtained by dividing 
those given in the table by 2. The “half” version given here corresponds to x = 0, while the “full” version is supported on the 
whole real line, except for the full lognormal distribution in which x >- A. 

1 Special cases: Chi-squared and Erlang. 

2 Special case: Rayleigh. Antisymmetric case (in which F(x) — F(x)): Fréchet. 

3 Also known as Chi. 

4 Special cases: Maxwell-Boltzmann, Maxwell-Jiittner, Nakagami. Antisymmetric cases: Inverse-chi-squared, Inverse-gamma, 
Lévy. 

5 More precisely, Pareto II or Lomax. 

6 Also known as Pareto III and IV, Burr XII and Feller. Antisymmetric case: Log-logistic. 

7 For d= 1 becomes PBF with tail index €= 1/c. For d>1, €=0 (all moments exist). For d < 1, € = co (no moment exists). 

8 Also known as Tsallis or 1-particle kappa distribution (Olbert, 1968; Livadiotis and McComas 2013). Special case: Cauchy. 

° Special cases: Beta prime, F. Antisymmetric case: Dagum—often referred to in hydrology as the kappa distribution (Mielke, 
1973; Mielke and Johnson, 1973; Hosking, 1994) but it is totally different from the kappa distribution of footnote 8. 
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2.11 Tails, heavy-tailed and light-tailed distributions 


There is a substantial difference between the distributions corresponding to equation 
(2.61) on the one hand and (2.58) on the other hand. Specifically, the former are light- 
tailed and the latter heavy-tailed. In heavy-tailed distributions for any t > 0 (however 
small) the following limit diverges to infinity: 


lim e'* F(x) = 00 (2.62) 
X—-0O 


In turn, a heavy-tailed distribution is characterized by the so-called tail index (or, in 
case of ambiguity, higher-tail index), defined to be that number é for which the following 
asymptotic equation holds true: 


lim x¥/$ F(x) =1 (2.63) 
X00 


where l/ is nonzero and finite. The distributions listed in Table 2.5 under the title 
Hyperbolic background measure are heavy tailed. Those of them in which a parameter & 
appears have tail index € (e.g., Pareto, Pareto-Burr-Feller). The remaining (e.g., lognormal) 
have tail index zero (except a specific case of the generalized log-gamma, shown in the 
table footnotes, whose tail index is infinite). At the same time, the moments of heavy- 
tailed distributions also diverge beyond a certain order, i.e., E[x?] = 00 for all p > 1/€, 
where € is the tail index. The distributions with zero tail index, such as the lognormal 
distribution, have all their moments finite. For that reason, they are often regarded as 
light-tailed. However, the lognormal distribution clearly satisfies (2.62) and therefore 
according to this definition is heavy tailed. 

In a similar manner, we can define a lower-tail index whenever the domain of the 
distribution is the entire line of real numbers (we must replace 00 with -oo and x1/‘ with 
|x|1/5). However, usually we deal with nonnegative quantities and, in this case, we need a 
different manner to define the lower-tail index. Specifically, the lower-tail index is that 
number ¢ for which the following asymptotic equation holds true: 


3 -¢ = if 
lim x F(x)=1 (2.64) 


where l’ is again nonzero and finite’. Those distributions listed in Table 2.5 in which a 

parameter ¢ appears have lower-tail index ¢ (e.g., Gamma, Weibull, Pareto-Burr-Feller). 

Using I’Hépital’s rule, we see that lim eho) = lim x1-S F(x) /f and thus if, ¢ < 1, the 
x x-> 

density f(x) should necessarily be a decreasing function, at least close to the origin, with 

lim f (x) = 00. In contrast when ¢ > 1, the density f(x) is increasing close to the origin, 

x7 


with f(0) = 0, and is usually bell-shaped. The particular case ¢ = 1 is characteristic of the 
exponential and Pareto distributions, in which f(0) is finite and the density f(x) is a 
decreasing function. 


“It would be more natural to use 1/Z instead of Zin (2.64) so that it be more consistent with (2.63). However, 
we used that convention in order for the parameterization of common distributions, such as Gamma and 
Weibull, to be similar to the one dominating in the statistical literature. 
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Later we will discuss how both indices € and ¢ can be visualized in a probability plot 
(see Digression 5.A). 


Digression 2.H: The hydrometeorological importance of heavy-tailed 
distributions 


In classical statistical mechanics the Lebesgue measure is used as background distribution. As a 
consequence, a constrained mean results in exponential distribution which notably has coefficient 
of variation o/u = 1. However, in several hydrometeorological processes, most notably rainfall, 
when the time scale is small (e.g., daily or hourly), the empirical o/u is greater than one, which 
means that the exponential distribution is not suitable. One may think that adding one more 
constraint would fix the problem. The natural choice seems to be to constrain entropy 
maximization by both the mean yp and the variance o?. However, this does not work as for 
nonnegative stochastic variable entropy maximization with Lebesgue background measure 
cannot yield o/u > 1. In other words, the exponential distribution is the upper limit. 

The next solution to try is either to use a trickier (less natural) constraint, to change the 
definition of entropy (using a generalized definition) or to change the background measure. The 
first two cases have been studied in Koutsoyiannis (2005a) and Papalexiou and Koutsoyiannis 
(2012) and the last one in Koutsoyiannis (2017). Whatever the choice may be, the result is 
practically the same: a heavy-tailed distribution. The easiest way to derive that distribution is by 
the framework described above, using the hyperbolic background measure and a single 
constraint, the mean of the distance function. The resulting Pareto distribution has o/y = 1 / 
V~l=2é>1. 

In other words, by changing the background measure from Lebesgue to hyperbolic, the light- 
tailed exponential distribution changes to the heavy-tailed Pareto one. The theoretical framework 
otherwise remains unaffected—the same probabilistic definition of entropy is used in both cases. 
But the change in the derived distribution may have important consequences in the design and 
management related to extreme events. To illustrate this based on real world data we use the 
daily rainfall data of Bologna, a data set already studied in section 1.3. 

During the 206 years of observations there were 19 426 rain days, all of which are used in the 
modelling. The nonzero rainfall depths of all 19 426 days are plotted against their empirical 
return periods in Figure 2.4. Following the initial discussion of the concept of return period in 
section 1.5, the return period of an observed value x is related to the probability of exceedance by 
T(x) = D/F (x), where D = 1 d. More accurate and detailed discussion of return period will be 
provided in Chapter 5. 

The 19 426 values range between 0.1 and 155.7 mm, witha mean of 7.2 mm. In the exponential 
distribution the single parameter A equals the mean, which allows plotting the theoretical curve 
corresponding to it in Figure 2.4. Clearly, the comparison with the empirical points of the figure 
indicates a bad performance of the exponential model. In contrast, the Pareto model, also plotted 
in Figure 2.4 looks suitable. It is admirable that a model with only two parameters (the tail index 
€ and the scale parameter A) can make such a good fitting on so many observations of 206 years. 
The parameter values, € = 0.11 and A = 7.78 mm), have been estimated by a least squares method 
to minimize the error between the empirical and theoretical return period. The empirical return 
period has been assigned by the method described in section 5.6). The good performance of the 
Pareto distribution suggests that the hypothesis of a hyperbolic background measure, along with 
the principle of maximum entropy, leads to a good predictive capacity. 

Now, comparing the behaviour of the light-tailed exponential distribution with the heavy- 
tailed Pareto distribution, and both with the empirical distribution, we clearly see that the former 
underestimates severely the magnitude of the extremes. Notably, for a return period of 10 000 
years, which is typically used in the engineering design of major projects such as dams, Figure 2.4 
shows that the exponential distribution predicts a rainfall depth of ~100 mm, a value that was in 
fact exceeded seven times in the 206-year available record. On the other hand, the Pareto 
distribution predicts a value of ~250 mm, 2.5 times higher (and as we will see in Digression 6.E, 


58 CHAPTER 2 - BASIC CONCEPTS OF PROBABILITY WITH FOCUS ON EXTREME EVENTS 


it becomes even higher if we also take into account the dependence structure of rainfall). Thus, 
inappropriate model selection, based on inappropriate theoretical considerations, may have 
substantial consequences in practical application. Sooner or later, nature per se will reveal the 
inappropriateness (e.g. by frequent exceedances of the design values). In such cases, one could re- 
examine the theory (even though an alternative more popular practice is to blame external agents 
and find good scapegoats). 
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Figure 2.4 Rainfall depth vs. return period for Bologna based on 19 426 daily rainfall depths 
observed throughout the observation period of 206 years. 


Indeed, in the 20* century, the light-tailed distributions constituted the dominant theoretical 
model in research and engineering practice. And given the substantial underestimation of 
extremes by this model, its failure (and its severe consequences) should not be regarded a 
surprise. By now, both theoretical advances and accumulated empirical evidence have shaken this 
model and have pointed to heavy-tailed distributions. More details will be provided in Digression 
8.F. 

In addition, Koutsoyiannis (2004a, 2005a, 2007) discussed several theoretical reasons that 
favour the heavy tailed distributions over the exponential case, which are consistent to the above 
discourse related to the hyperbolic background measure. Furthermore, the already discussed 
(Chapter 1) omnipresence of change and the non-static climate are consistent with heavy-tailed 
distributions, as will be further illustrated in Digression 3.H. 


2.12 Two variables: joint distribution and joint moments 

The above sections have been devoted to concepts of probability pertaining to the analysis 
of a single variable x. Often, however, the simultaneous modelling of two (or more) 
variables is necessary. Let the pair of stochastic variables (x, y) be defined on two basic 
sets (Qx, Qy), respectively. The intersection (simultaneous occurrence) of the two events 
{x < x} and {y < y}, denoted as {x < x,y < y} is an event of the sample space Q,, = 
2, X Ny. Based on the latter event, we can define the joint probability distribution function 
of (x, y) as a function of the real variables (x, y): 
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yy) = P{xsxuy sy} (2.65) 


The subscripts xy can be omitted if there is no risk of ambiguity. 
If Fy is differentiable, then the function: 


OFX: y) 


2.66 
Ox Oy ( J 


fey, y) = 


is the joint probability density function of the two variables. Obviously, the following 
equation holds: 


x Yy 
Fy) = [| for(Fw)deds (2.67) 


—0co —00 


The functions: 
F(x) = P{x Sx}= lim Fy(@%y), yO) = Ply sy}=limFy@y) (2.68) 


are called the marginal probability distribution functions of x and y, respectively. Also, the 


marginal probability density functions can be defined, from 


fx) = | feyledy, — f(y) = | fey, y)dx (2.69) 


Similar to the univariate case, we can define the expected value of any given function 


g (x, y) of the stochastic variables x and y by 


E lg (x, y) | = i | IOY) fry y)dxdy (2.70) 

In this manner, we define the (noncentral or about the origin) joint moment of orders p, q 
as: 

Mpg = E [x?y"| = | | xPy4 f(x, y)dxdy O71) 


as well as the joint central moment of orders p, q: 


co © 


ya = E[(x—we)’ (y-my)']= | [ @- me)’ (v- my) foGerdexdy (2.72) 


If p =0 or q = 0, we get the marginal moments (e.g., means, [x = M19, Hy = Mo13 
variances, var|x] = E [x — x) | = Uy) = Vx = GZ, Var [| = Uo2 = Vy = Oy, etc.). The 


lowest-order joint central moment is the covariance: 


cov [x,y] = E[(x- ux) (vy — uy) | = oar = Oey = E [xy] - Elx]E[y| (2.73) 
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A dimensionless index derived from covariance is the correlation coefficient: 


cs Oxy 
Ty t= TO, (2.74) 


It is common knowledge (and easy to prove’) that —1 < 7, < 1, where the values -1 and 
1 indicate fully anti-correlated (fully negatively correlated) and fully (positively) correlated 
variables. The particular case where: 


Oxy =hy =OSE [x = E[x|E [>| (2.75) 


defines uncorrelated variables. Independent variables are necessarily uncorrelated, but 
independence is a stricter concept whose definition is: 


Fy y= AOKMW), fyloy) = kOAO) (2.76) 


The joint entropy is defined in an analogous manner with that in the univariate case 
(section 2.9). For discrete stochastic variables the entropy is: 


® |x, y| = = E|- In P ( (x) =~ DP In Pij (2.77) 


where P;; = P {x =XxX,y= y;}. For continuous stochastic variables it is: 


(oe) 


ole afm te2) = f fiw enrinay 7 


2.13 Conditional densities and expectations 





Of particular interest are the so-called conditional probability distribution function and 
conditional probability density function of x for a specified value of y = y; these are given 


by: 


fay G Y)GE fey %Y) 


Fry Oly) = Po fey Oly) = eG) ) (2.79) 


respectively. Switching x and y we obtain the conditional functions of y. 





“From the obvious E (zx + y) | = E[x?]+E [y?] + 2E [x yl, observing that the two sides are nonnegative 


quantities and assuming, without loss of generality, Be = E|y] = 0,so that E[x?] = o2,E [y?] = 05, 
E [x»] = Gyy , we get o2 +02 + 20 20 OF Oyy/Oydy = —(1/2)(ay/ay + oy/o,) = Race +1/a), 
where a = 0,/0, > 0. It is easy to see that (a + 1/a) ie minimum value 2, so that oy/0,0, = —1. 


Furthermore, E I(z- y) | = Elx?|+E [y?] — 2E [xy] and, likewise, d,/0,0) < a. + 
dy/o,) = (1/2)(a + 1/a) <1. 
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The conditional expected value of a function g(x) for a specified value of y=yis 
defined by 
E[g(x)ly] = E[9(x)|y =] = | I(x) ery ly) dx (2.80) 


An important quantity of this type is the conditional expected value of x: 


co 


Elely] = [xly=y]= | xfayGelydax (2.81) 
Likewise, the conditional variance is 
var[x|y] = E|(x - E[xly])"|y = y] = | (x — E[xly])° fey Cely)dx (2.82) 


and can be also written as 


var[xly] = E[x?|y] — (Elzly])” (2.83) 

It is obvious from the definition (2.80) that the conditional expectation E[g(x)|y] is 
function of the real variable y, call it h(y), rather than constant. If we do not specify in the 
condition the value y of the stochastic variable y, then the quantity E [o(x)|y| = Avy) 
becomes function of the stochastic variable y: Hence, it is a stochastic variable itself. Its 


own expected value is: 


E|E[s(@)[y|| - Elo (x)ly lh Ody = | i 9) fey (oy) dxedy (2.84) 
where we have utilized (2.80) and (2.79). As a result, 
E\E[s(@)|y]| = Els) (2.85) 


This can be readily generalized for a function of two stochastic variables, i.e., 


E[e|o (xx) bl] = Flo(=»)) (2.86) 


Entropy, as formally defined for the univariate case in section 2.9 and for the bivariate 
case in equations (2.77) and (2.78), is an expectation and thus we can also define 
conditional forms of entropy which are quite useful. Thus, for a specified value of y = y 


and for a discrete stochastic variable the entropy is: 
[xy] = B[-In P(xly)] = — ) Pay ln Pay (2.87) 
Lj 


where Pj; = P {x =xi|y = yj}, and for a continuous stochastic variable it is: 
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o[xly] = ef fa) Sita fn e xly)dx (2.88) 


These quantities depend on the specified eadideine value y of y. However, we can 


define a global conditional entropy, for an unspecified value of y. 


B|e [zty]| =E|E [-In P (xly) iy]| = “2.2,P0 In Py; Pj (2.89) 


For continuous stochastic variables it is: 


E|® [zi] = fel ne ly --[ Jn int na Ifdxdy (2.90) 


A relationship analogous to (2.85) does not hold in this case. This is easy to verify as 





E[| xly]| -- fo oe fe y)dx dy # -| i. yf, y)dxdy = [x] (2.91) 


In fact, the true relationship between the (global) conditional entropy and the marginal 
one is an inequality (e.g. Papoulis, 1991, p. 564): 


E |e [zI>] < [x] (2.92) 
Another distinction we have to stress is that: 
E |e [z1]] # &|x\y| (2.93) 


because the latter quantity is generally a function of y while the former is not. An 
interesting exception is the case of a bivariate normal distribution in which &[x ly] turns 
out to be a constant rather than a function of y (® [xly] = @, [x] < ®|x]). Generally, we 
should stress that: 


e conditional expectations like E[x|y] are deterministic functions of the conditioning 
value y; 
e conditional expectations like E[x|y] are stochastic variables, depending on y. 


These remarks have to be added to the notes of Digression 2.A about the importance of 
notation. 


Digression 2.I: Does information decrease entropy? 


It is intuitive to say that, if a stochastic variable x has some relationship with another stochastic 
variable y then, if we observe the value of y, our uncertainty on x would decrease. As entropy is a 
formal measure of uncertainty, this can be formally stated as follows: the conditional entropy of 
x given information on y is smaller than the unconditional entropy of x. However, this simple 
truth is sometimes contradicted in scientific texts. The reasons of the contradiction are the 
inattentive use of concepts and inattentive notation. We will illustrate them with the following 
example. 


MANY VARIABLES 63 


In Digression 2.C and Digression 2.D we studied the probabilities of the dry and wet (rain) state 
in an area. Continuing this study, we now introduce the stochastic variables and x and y for 


today’s and yesterday’s state, respectively, with {x = o} and ee = 1} representing a dry and wet 
state of today, respectively, and likewise for yesterday. We assume for illustration the conditional 
probabilities: 


Pes lyot ate Bead =O) iis 
from which it directly follows that 

P{x=0ly=1}=06, P{x=0ly=0}=0.85 
and after some simple calculations we also find the marginal probabilities to be 

Pi — Ol 00a eer a 7 
Hence the unconditional entropy is: 
|x| = E[—In P@x)] = —0.81n 0.8 — 0.21n 0.2 = 0.500 
while the entropy conditional on yesterday being dry is: 
® |xly = 0] = E[-InP(z|y = 0)] = -0.85In0.85 — 0.15In0.15 = 0.423 
and that conditional on yesterday being dry is: 
© |xly = 1| = E[-InP (z|y = 1)] = -0.6In0.6 - 0.41n0.4 = 0.673 


Now it is true that the information that yesterday was a wet day increased the entropy from 
0.5 (without any information) to 0.673. This happened because the probabilities of the two states, 
which initially were 0.8 vs. 0.2, far from the equiprobability (0.5) in which the entropy is 
maximized, have now approached each other (0.6 vs. 0.4) and thus the entropy increased. 

But this happens for that particular value, y = 1. If we consider all values (in our case two), on 


the average the (global) conditional entropy is 
E |e [zl] = 0.500 x 0.8 + 0.423 x 0.2 = 0.473 < 0.500 


In other words, the reply to the question in the above title is: Yes, information decreases entropy, 
but we must be attentive about the correct use of the probabilistic concepts. 


2.14 Many variables 


All above theoretical analyses can be easily extended to more than two stochastic 
variables. For instance, the distribution function of the n stochastic variables xj, ..., x, is 


| an CE wy Xq) = P{Xy, SXy, 0X S Xn} (2.94) 


and is related to the n-dimensional probability density function by 


X41 Xn 


ae CT | : | fecoop Eur Fn) dE qn” dE, (2.95) 


The variables xj, ...,X, are independent if: 


Foes petty C1) aged > Fy, (x1) vee F,,, (n) (2.96) 
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A useful rule to mention is the so-called chain rule, which allows expressing joint 
densities as products of conditional densities: 


f 1, Xn) = f Snipe M1) fe. 4D F1) (2.97) 


where for notational brevity we have omitted the subscripts of functions f(), as these are 
identical to the arguments of the functions. A direct consequence for the evaluation of 
entropy is 


[x1 0 Xa] = ElP[Xnlen—a 2a] + + ElG(xe|x1)] + ELS (xa)] (2.98) 


The expected values and moments are defined in a similar manner as in the case of two 
variables, and all properties discussed in section 2.12 are likewise generalized for 
functions of many variables. 


2.15 Linear combinations of stochastic variables 


A consequence of the definition of the expected value is the relationship 


E[c1.91 (X41, X2) + C2go(x1,X2)] = crE[gs (241, x2)] + CoE[g2(x1,x2)] (2.99) 


where c, and cz are any constants, and g, and g> are any functions. Apparently, this 
property can be extended to any number of functions g;. Applying it for the weighted sum 
of two variables we obtain 


E[a,x1 + a2X2] = a,E[x,|+a,E[x2| (2.100) 

Likewise, we can calculate the variance of the weighted sum. After some algebraic 
operations we get 

var|a,x, + A2x2| = a?var|x,| + agvar[x.] + 2a,azcov[x,, x2| (2.101) 


It is much more difficult to calculate the probability distribution function of such 
combinations. As an example, for the simplest case, the sum z= x, +X, of two 
independent variables x, and x2 has density: 

f@)= | fel@— 22) fe, Geaddee (2.102) 
The right-hand side is known as the convolution integral of f,,(x) and f,,(x). For 
nonnegative variables it takes the form: 


£,@= | fel2—%2) fig (ep)de,  2>0 (2.103) 


2.16 Variance-based correlation and the climacogram 


While covariance and its equivalent standardized form, i.e., correlation, have been the 
most customary tools to characterize dependence, they are neither the only nor the most 
effective ones. Assuming two stochastic variables x, and x, (possibly representing 
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different physical quantities) with means yp; (i = 1,2), standard deviations o;, covariance 
0,2 and correlation coefficient 7,2, we may form a different type of a correlation coefficient 
and covariance by examining a weighted sum of the two variables. Namely, we examine 
the average of the variables x; after standardizing them with their standard deviations o;, 
which is necessary if they represent different physical quantities, in order to make them 
compatible for addition. From (2.101) we obtain that the variance of this average is: 


Ae eee 1 — X2 — 2 1 1 hy % 
var|>(24+=2)|=2e (2-#.2-") =5+500v[=,—| (2.104) 
2\0, 2 4 O71 Oz 2 <Z 01 O02 


where we can recognize in the rightmost term the correlation coefficient r,,. Defining 


1 (x, X2 1/ [o> O71 
P12 *= var F (= + =) , Vag Oi Pia = Val 5 a, 5 Oo, (2.105) 


we find from (2.104) that 


1+ 12 0102 + 042 


Pr2 = ian Vi= 5 (2.106) 





Obviously, the same information as in r,2 is contained in p,2, which lies in the interval 
[0, 1] with the values 0, 1/2, 1 representing fully anti-correlated, uncorrelated and fully 
correlated variables, respectively. Consequently, y,2 lies in the interval [0, 0,02] with the 
values 0, 0,0,/2, 0,02 representing fully anti-correlated, uncorrelated and fully correlated 
variables, respectively. 

The power of the notion of 9,2 and yz is the fact that, unlike 7,2, they can be readily 
expanded to many variables to provide a macroscopic (or bulk) measure of correlation 
among all of them. Considering a number x > 0 of stochastic variables, in the customary 
case where all have identical variances y, = 07, we write: 
pe var] =, var |=rnee Xen tn tee (2107) 

1 

Clearly, y,. is the climacogram, already defined in Chapter 1, and p, is a dimensionless 
(standardized) climacogram. They range in the intervals (0,y,) and (0, 1), respectively, 
with the highest value representing full correlation (x, + ++: + x, = KX, +c, where cisa 
constant) and the lowest representing deterministic linear dependence, i.e. the condition 
that x, + +++ x, = c). In case of independence, y, = y;/k and p, = 1/k. 


2.17 Limiting distributions and the central limit theorem 


As we have seen in section 2.15, it is rather difficult to calculate the distribution function 
of the sum of two stochastic variables from the distributions of the constituents. This 
difficulty increases as the number of constituents increases. However, if this number 
becomes quite large, paradoxically the problem becomes easier—this is the ease of 
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macroscopization. Central role in resolving this paradox plays the central limit theorem’, 
one of the most important in probability theory. It concerns the limiting distribution 
function of asum of stochastic variables - constituents, which, under some conditions but 
irrespectively of the distribution functions of the constituents, is always the same, the 
celebrated normal distribution. This is the most commonly used distribution in probability 
theory as well as in all scientific disciplines and, as we have seen in section 2.10, it is also 
derived from the principle of maximum entropy for Lebesgue background measure and 
constrained variance. 

Let x; (i = 1,...,n) be stochastic variables whose sum Z, ‘= x; + ++: +X, has mean pw, 
and variance o2. The central limit theorem states that, under some general conditions (see 
below), as n tends to infinity, the distribution of z will tend to the normal distribution 
(also known as Gauss or Gaussian distribution and denoted as N(,,0,)), i.e., 

Z 
lim F,,(z) = | 
n> 


—0o 





4 x2 
e -5C a) 





oi (2.108) 


and in addition, if x; are continuous variables, the density function of z, has also a limit, 





eae 
e —5C oa 





lim f,,(2) = (2.109) 


O,V2T 

We observe in (2.108) and (2.109) that the limits of the functions EF, (z) and f,, (z) do 
not depend on the distribution functions of x;, that is, the result is always the same. Thus, 
provided that the conditions for the applicability of the theorem hold, (a) we can know 
the macroscopic behaviour (the distribution function of the sum) without knowing details 
of the constituents, and (b) precisely the same distribution describes any variable that is 
a sum of a large number of components. Here lies the great importance of the normal 
distribution in all sciences (mathematical, physical, social, economic, etc., as well as 
stochastics per se). 

In practice, the convergence for n + can be regarded as an approximation if n is 
sufficiently large. But how large should n be so that the approximation be satisfactory? 
Generally, the literature suggests that a value n = 30 is satisfactory. However, this varies 
depending on the (joint) distribution of the constituents x;. Figure 2.5 gives a graphical 
illustration of the convergence to the normal distribution of the sum of n independent 
variables. Clearly, if the distribution of x; is symmetric (left panel, with uniform 
distribution of x;), the convergence is rapid (even for n = 3) but if it is asymmetric (right 
panel, with exponential distribution of x;) a value higher that 32 (the highest n shown in 
the plot) is needed for a satisfactory convergence. In case of dependent x; with positive 
correlation, the convergence is slower and a much larger n is needed. 


* The term was most likely introduced by Pélya in 1920. A first version of the theorem was formulated and 
proved by Laplace in 1810 while at about the same time Gauss studied the normal distribution in 
characterizing measurement or model errors. Earlier, in 1733 de Moivre had introduced the normal 
distribution as an approximation of the binomial distribution (Fischer, 2010). 
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Figure 2.5 Convergence of the sum of independent identically distributed stochastic variables to 
the normal distribution (thick line). The thin continuous lines represent the probability density of 
the constituent variables x;, which have mean 0 and standard deviation 1. On the left panel the 


density is uniform on the interval (-V3,V3) with f(x) = 1/(2V3) and on the right panel 
exponential, f(x) = e~*~1, x => —1 (the parameters are chosen so as to have mean 0 and standard 
deviation 1). The dotted lines represent the densities of the sums Z,, = (x1 +++ + Xn)/Vn for the 
values of n indicated in the legend. (The division of the sum by V7 helps for a better presentation 
of the curves, as all z, have the same mean and variance, 0 and 1, respectively, and does not affect 
the essentials of the central limit theorem.) 


The conditions for the validity of the central limit theorem are general enough, so that 
they are met in many practical situations. Some sets of conditions (e.g. Papoulis, 1990, p. 
215) with particular interest are the following: (a) the variables x; are independent 
identically distributed with finite third moment; (b) the variables x; are bounded from 
above and below with variance greater than zero; (c) the variables x; are independent 
with finite third moment and the variance of z,, tends to infinity as n tends to infinity. The 
theorem has been extended for variables x; that are interdependent, but each one is 
effectively dependent on a finite number of other variables. Gnedenko and Kolmogorov 
(1949) proved an extended version of the theorem, according to which the sum of n 
stochastic variables with heavy tail distributions with tail index € > 1/2, therefore having 
infinite variance, will tend to the co-called Lévy alpha-stable distribution, as n > oo. If &< 
1/2, the standard central limit theorem holds, i.e., the sum converges to the Gaussian 
distribution, which is a special case of Lévy alpha-stable distribution. In the field of 
hydroclimatic processes, the standard theorem suffices because we can justifiably assume 
that those processes have finite variance: an infinite variance would presuppose infinite 
energy to materialize, which is absurd. 

Most hydroclimatic processes (particularly rainfall and streamflow) have skewed 
distributions at fine time scales, and therefore the normal distribution is not a suitable 
model at these scales. However, the normal distribution describes with satisfactory 
accuracy variables that refer to longer time scales such as annual. Thus, the annual rainfall 
depth in an area with wet climate is the sum of many (e.g., 50-100) rainfall events during 
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the year; this, however, is not valid for rainfall in dry areas. Likewise, the annual runoff 
volume passing through a river section can be regarded as the sum of 365 daily volumes. 
These are not independent, but the central limit theorem can be applicable again. 

Nonetheless, it should be stressed that the convergence to the normal distribution 
concerns the body of the distribution. For example, what is depicted in Figure 2.5 is about 
the body of the distribution. What happens with the tail behaviour, i.e., the extremes? 
Apparently, once the theoretical conditions of validity are satisfied, the theoretical result 
should hold true. However, this may not be of any help in practice as the convergence in 
the tail is much slower. Figure 2.6 (left) shows that the convergence in the tail is indeed 
slow for the exponential distribution, much slower than that of the body of the same 
distribution shown in Figure 2.5 (right). The coefficient of skewness for the sum of 32 x; 
is rather small (0.35) indicating a rather satisfactory approximation by the normal 
distribution. However, Figure 2.6 (left) shows that for F = 0.001 the distribution of the 
sum of 32 x; is by an order of magnitude larger than that of the normal distribution. 
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Figure 2.6 Convergence of the sum of independent identically distributed stochastic variables to 
the normal distribution (thick line) with focus on the right tail. The thin continuous lines represent 
the exceedance probability of the constituent variables x;, which have mean 0 and standard 
deviation 1. On the left panel the distribution is exponential with density f(x) = e-*~1,x > —1as 
in the right panel of Figure 2.5) and on the right panel Pareto with tail index 1/4 and exceedance 
probability F(x) = (4/3 + xV2/3) x > —1/V2 (the parameters are chosen so that the mean is 
0 and standard deviation 1). The dotted lines represent the exceedance probability of the sums 
Zn = (X1 + +++ X,)/Vn for the values of n indicated in the legend. Their curly shape in the right 
panel is due to the numerical (Monte Carlo) method used to construct them as analytical 
integration is impossible beyond n = 2. 


For a heavy tailed distribution there are differences of several orders of magnitude as 
shown for the Pareto distribution in Figure 2.6 (right). The tail index of this Pareto 
distribution is 1/4, which means that the moments below the fourth order exist and 
therefore the necessary conditions for the central limit theorem are satisfied. Despite that, 
the approximation of the distribution tail is unsatisfactory. Actually, it can be easily 
understood that, as the moments for order 2 4 of x; are infinite, the same will hold for the 
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sum of any finite number of x;, while the limiting normal distribution has all its moments 
finite. This conflict, along with the fact that the behaviour of extremes is closely connected 
to the high order moments ofa distribution (see Chapter 6) suggests that we must be very 
attentive in the application of the theorem in hydroclimatic processes, particularly 
because these processes seem to exhibit heavy tails and long-range dependence. 


2.18 Limiting extreme value distributions 


By analogy with the central limit theorem referring to the sum or the average of many 
variables, limiting distributions may also arise, as n — ©, for the maximum of these 
variables, y, = max(x1, ea whose exact distribution function for independent and 


identically distributed x, is: 


F,,(y) = (Fe(v))" (2.110) 


The relevant theory was developed in the 20* century. Historically, it was Fréchet (1927) 
the first to identify one of the asymptotic distributions of maxima, which bears his name. 
Fisher and Tippett (1928) showed that there are only three possible limiting distributions 
for extremes, while von Mises (1936) identified sufficient conditions for convergence to 
the three limiting laws. Gnedenko (1943) set the solid foundations of the asymptotic 
theory of extremes providing the precise conditions for the weak convergence to the 
limiting laws. It is worth noting in this respect the celebrated book by Gumbel (1958), 
who was one of the pioneers promoting and applying the formal theory into engineering 
practice. The theory is concisely presented in a review paper by Davison and Huser 
(2015). Assuming that x; are independent and identically distributed, there exist a real 
number é and sequences of real numbers /,, > 0 and €, such that the rescaled maximum 
y= max(x ; eee) /An — En has limiting distribution, as n > oo: 


-1/E 
H(y) = Fy;,(y) = exp (- (: +é (- - :)) ) fy >A (« = =) (2.111) 


Here A > 0 is a scale parameter, ¢ is a dimensionless location parameter and € is a shape 
parameter, identical to the tail index. 

The parameter € has a unique value, which is precisely the same with the tail index of 
the parent distribution, but the parameters A and € are not unique. They can be chosen as 
convenient (different choices will lead to appropriate modification of the sequences 1, 
and ¢,,). A natural choice is e = 0,A = 1.A more customary option is to choose a large n for 
which convergence has been achieved at a satisfactory degree, for that n set A,, = 1 and 
En = 1 (so that yy = Yn) = max(x1, eg) without any rescaling) and calculate A and € 
from equation (2.111). To this aim (and given that, for finite n, (2.111) is an 
approximation and not an exact relationship) we choose two points x, and x, and equate 
F(x)" with H(x) at these points. For mathematical convenience we can choose the two 
points so that —x,/A+ ¢€=0,—-x2/A+¢=-1, or x, = Ae, x2 = Ae + A. Hence, Fe)” = 
e4,F(et+A)" = ents Solving for A and ¢€ we find: 
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1 
1 Ful (e- ") 
jt (© wom) a (e-*), e=— (2.112) 


where for > 0, (1+ €)~¥* = 1/e. This is usually made unconsciously, for example when 
we study annual maxima of daily values and fit H(y) of equation (2.111) on the annual 
maxima directly, without even deriving it from F,(x). 

Depending on the value of , the limiting distribution in equation (2.111), known as the 
generalized extreme value (GEV) distribution, is a compact expression including three 
cases with different behaviours: 


e For € = 0, GEV takes the following form, known as the Gumbel distribution or 
extreme value type I (EV1) distribution: 


H(y) = exp(— exp(—y/A + €)) (2.113) 


This is a light-tailed distribution without an upper or lower bound. 
e For é>0, the distribution is known as Fréchet or extreme value type II (EV2) and 
has a lower bound at AW — A/€é. This is a heavy-tailed distribution with tail index €. 
e Incase that é < 0 the distribution is known as the reverse Weibull or the extreme 
value type III (EV3) distribution. This has an upper bound for y at Aw — A/é. 


The GEV has the property to be max-stable, meaning that maxima from this 
distribution, after linear transformation, have the same distribution. More formally, 
Fréchet’s necessary condition for max-stability is this: For anyn € Nandy € R, there exist 
real numbers a, > 0 and b, such that: 


(H(any + bn)" = H(y) (2.114) 


In fact, GEV is the only distribution satisfying this condition. 

A specific parent distribution F (x) belongs to the so-called (max-)domain of attraction 
of one of the three limiting laws, in the sense that the distribution of rescaled maxima 
from this parent distribution is this particular limiting law. Formal mathematical 
conditions determining a parent distribution’s domain of attraction were formulated by 
von Mises (1936) and Gnedenko (1943). The practical result is that heavy-tailed 
distributions with tail index € > 0 (e.g., Pareto, Pareto-Burr-Feller, Student and its 
extension, generalized log-gamma and generalized beta prime) belong to the domain of 
attraction of EV2. Light-tailed distributions (e.g. exponential, gamma, Weibull, normal and 
their generalizations) as well as heavy-tailed distributions with tail index € = 0 (e.g. 
lognormal) belong to the domain of attraction of EV1. In the domain of attraction of EV3 
belong distributions bounded from above (e.g. uniform). 

Because of its upper bound, EV3 is not an appropriate model for hydroclimatic 
extremes, for nature has no upper limits. The values of € which we expect to see in 
hydroclimatic processes are in the range (0, 1/2) so that the variance be finite, as already 
discussed in section 2.17. The exact value of the tail index is important to specify in 
engineering design. The major question in this regard is how the value of an extreme 


quantity y grows as the probability of exceedance H(y) decreases tending to zero. To put 
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it the reverse way, at which rate does y tend to infinity as the probability of exceedance 
tends to zero? The Gumbel distribution represents the mathematically proven lower limit 
to the rate of this growth. The alternative is the Fréchet law which represents a higher 
rate of growth. The two options may lead to substantial differences in design quantities 
for high return periods. As already discussed, the Fréchet law which has a positive tail 
index is a more realistic option. 

When we are interested about minima, we can follow the same procedure observing 
that Z», = min(x,,...,X,) = —max(—%x,,...,-x,). Consequently, P{z, <z}=1- 


P{max(—x,, a —Xn) > —z} and hence the limiting distribution is 


G(Z) = Fz:,(y) = 1— exp (- (: +§ (-5 = )) ) Ez <A ( — e) (2.115) 


Again, we have three cases: (a) € = 0, corresponding to the Gumbel (EV1) distribution of 
minima, i.e., 


G(z) = 1 — exp(— exp(z/A + €)) (2.116) 


(b) €> 0, corresponding to the reverse Fréchet distribution, which has upper bound A/é — 
Ae and a heavy lower tail, and (c), € > 0, corresponding to the Weibull distribution, which 
has lower bound A/é — Ae and a light upper tail. 

While most of the above mathematical developments have assumed independent 
stochastic variables, the results can be approximately valid even in case of variables 
dependent in time. Specifically, Leadbetter (1983) demonstrated that, under mild 
conditions, maxima of dependent series follow the same form of distributional limit laws 
as those of independent series. However, the dependence changes the location and scale 
parameters (Davison and Huser, 2015) in such a manner as if H(y) was replaced by 


(H())’, where 6 € (0,1] is the so-called extremal index. It can be seen that this 
replacement is equivalent with a change of the parameters A and ¢, while € remains the 
same. Also, the rate of convergence to the limit becomes slower in case of dependence. 
Phenomenologically, time dependence of a process causes clustering or grouping of 
extreme events. The unfortunate fact that dependence in time is quite often misinter- 
preted as nonstationarity, may explain the lately growing body of publications detecting 
nonstationarity in extremes (cf. Koutsoyiannis and Montanari, 2015). 

Here it should be stressed that, if compared to the central limit theorem, which is 
characterized by a fast convergence to the limit (except in the extremes, as seen in Figure 
2.6), the convergence to the max-stable distribution may be much slower at cases. The 
rate of convergence to the limit of distributions belonging to the domain of attraction of 
EV2 is generally satisfactory. However, for those belonging to the domain of attraction of 
EV1, such as the normal and lognormal distribution, the rate is desperately slow. The 
meaning of a slow convergence in real-world applications, where n is finite and often 
small, is that the approximation of EV1 to the actual distribution of maxima is not 
satisfactory. Thus, it may be preferable to approximate the actual distribution of maxima 
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of variables with distributions belonging to the domain of attraction of EV1 by the EV2 
distribution, as illustrated in Digression 2.]. 


Digression 2.J: How well do limiting distributions approximate the exact 
distributions? 


For independent identically distributed variables, the exact distribution of maxima is F(x)” 
(equation (2.110)). To approximate the exact distribution by the GEV we use equation (2.112). As 
an example, for the maxima from the standard normal distribution approximated by the EV1 we 
get: 


1 
wel el Fx? (e n) 

A= Fat(e en) — Fyt(e ny, ena 

As a second example, for the Pareto distribution, F(x) = 1 — (1 + x)~‘/‘, approximated by the 


EV2 we get: 


cas 
ease oat (1-e7n) -1 
a=(1-e nao) -(1-e"") ; [> 

We have applied this approximation for n = 10, 100 and 1000 for the normal and the Pareto 
distributions which belong to the domain of attraction of EV1 and EV2, respectively. The results 
are shown graphically in Figure 2.7, along with the case n = 1, i.e., the parent distribution per se 
for the sake of comparison. 

The results are very good for the Pareto distribution and very bad for the normal distribution. 
Even for n = 1000, the EV1 severely overestimates the actual probability of exceedance. One may 
think of using the EV3 instead of EV1 for the approximation of the normal distribution. However, 
this is not advisable because the EV3, despite giving a better approximation, entails an upper 
bound to extremes which distorts a fundamental characteristic of the modelled phenomenon. 
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Figure 2.7 Approximation of the true distribution of the maximum of n independent identically distributed 
variables (continuous lines) by the limiting extreme value distribution (dashed lines). (left) The parent 
distribution is the standard normal, N(0,1), and the approximating distribution is the EV1. (right) The 
parent distribution is the Pareto, F(x) = 1 — (1 + x)", with €= 0.25 and the approximating distribution 
is the EV2 with the same é. 


Likewise, Figure 2.8 provides similar information for the lognormal distribution with mean 
and standard deviation of In x equal to 0 and 1, respectively (denoted LN(0,1)). Like the normal, 
it belongs to the domain of attraction of EV1. Here the approximation is even worse than in the 
normal case but now the EV1 underestimates the exact probability of exceedance. For that reason, 
we could use EV2 as a better approximation (without having the problem of artificially inducing 
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an upper bound). As seen in the right panel of Figure 2.8 this latter approximation is quite 
satisfactory. 
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Figure 2.8 Approximation of the true distribution of the maximum of n independent variables with 
lognormal distribution LN(0,1) (continuous lines) by the limiting extreme value distribution (dashed lines), 
which is (left) EV1 and (right) EV2 with € = 0.3/n°°’, which was found after a numerical investigation and 
fitting a power function of n by minimizing the fitting error. 


2.19 Relationship of parent and extreme value distribution 


Because of problems originating from the slow rate of convergence of the actual 
distribution to GEV (particularly EV1), it may be a good idea not to use the limiting 
distributions in practical applications but, instead, to model the tail of the parent 
distribution or even the entire parent distribution. Yet the theory of max-stable 
distributions retains its usefulness to infer the tail behaviour of the parent distribution. 

Specifically, the tail behaviour of the parent distribution is described by the conditional 
distribution function: 


F(x) —F(@) 


F(x|x >u) = P{x <x|x >u}= 1— F(u) 


(2.117) 
for a value of the threshold u that is sufficiently large. Now, assuming that the 
parameterization of H (x) with regard to A and € has been made with reference to a specific 
large n, as described for the derivation of (2.112), we choose u so that the exceedance 
probability 1 - F(u) equals 1/n. (This is a very common choice as will be discussed in 
Digression 2.K). Thus, F(x|x > u) = n(F(x) —1)+1or: 


1—F(x|x > u) =n(1 — F(x)) (2.118) 
On the other hand, we can write for F(x) approaching 1, 


—InH(x) ~ —In(F(x))” —nIn F(x) = n(1 — F(x)) (2.119) 


because In F(x) =In(1— (1 — F(x)) —(1 — F(x)) — (1 — F(x))” —-+ and for F(x) 
approaching 1 we can keep the first term only. Hence, combining (2.118) and (2.119) we 
find: 
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-1/é 
F(x|x > Ae) = 14+ In A(x) =1- (: + E(= — :)) , x >AeE (2.120) 


where we equated u with Ae for consistency (i.e. to make F(ulx > u) = 0). This is the 
Pareto distribution for € > 0 while for € = 0 we get the exponential form: 


x 
F(x|x > Ae) =1+InH(x) =1-exp(5—e), x2>de (2.121) 


Furthermore, for values of x large enough to make H(x) approach 1, we can use the 
same logic to get In H(x) ~ H(x) — 1 and hence 


F(x|x > Ae) » H(x) (2.122) 


This approximation error does not exceed ~1% for H(x) > 0.99 and ~5% for H(x) > 0.9. 

We can generalize the above analysis for different values of the threshold €. In this case 
the resulting functional form remains the same, with the same tail index, but the location 
and scale parameters differ, i.e. (Davison and Huser, 2015): 


E 
F(x|x>u)=1- ( +E (= = -)) (2.123) 


where 
Ay= A+ ECu/A-e)), ey = Ufa, (2.124) 


It is readily confirmed that if we set u = Ae in (2.123) and (2.124) we recover (2.120). 
However, this equation is valid only for large values of u (unless the unconditional F (x) 
is Pareto, in which case it is valid for any wu). 

A final note that may be relevant in some analyses is this. If the value of n in y, = 


max(x,, tga) is not constant but a stochastic variable with Poisson distribution with 
mean v, while x; are independent, then the conditional distribution of y, on specified n 


remains Fy (y|n) = (F.(y))" but for unspecified n the unconditional distribution 
becomes (Todorovic and Zelenhasic, 1970): 


F,(y) = exp (-v(1 — F(y))) (2.425) 


This resembles (2.119) with the difference that it is now exact rather than approximate. 
As already discussed above and in section 2.18, because of the problems of the limiting 
extreme value distributions, it is preferable to focus the studies of extremes on the parent 
distributions and primarily their tails. From the above theoretical discussion, we have 
reasons to expect a parent distribution tail of Pareto type, or at least exponential, but this 
should be verified each time based on observations. Nowadays there is abundance of 
hydrometeorological data on daily and subdaily scales and there is no need to extract 
annual or seasonal maxima. Instead, we should use the entire observational record or at 
least the values over some threshold. Only if the available observations are originally 
given in terms of time-block (e.g., annual) maxima, it is pertinent to refer to extreme value 
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distribution. Advantages of studying the distribution of the parent variable rather than 
the distribution of maxima are discussed in Digression 2.K 


Digression 2.K: Block maxima vs. values over threshold vs. complete record 


Traditionally, hydrometeorological records are analysed in either of two ways. The most frequent 
is to choose the highest of all recorded values for a given time period or “block” (typically year) 
and form a statistical sample (commonly referred to as “block maxima”) with size equal to the 
number of blocks (typically years) of the record. The other is to form a sample of values over a 
threshold (here abbreviated as VOT but sometimes referred to as “peaks-over-threshold”—POT) 
with all recorded values over a certain threshold irrespective of the time they occurred. Usually, 
the threshold is chosen high enough, so that the sample size is again equal to the number of years 
of the record. This however is not necessary: it can well be set equal to zero, so that all recorded 
values are included in the sample (the complete sample). However, a high threshold simplifies the 
study and helps focus the attention on the distribution tail. In addition, this choice simplifies the 
mathematical expression (compare equations (2.120) and (2.123)), leading to identity of 
distributional parameters of the distributions of block-maxima and values over threshold. 

Additionally, studying the complete series of observations has the advantage of respecting the 
motto “Save hydrological observations!” (Volpi et al., 2019). Indeed, extracting maxima over some 
period results in waste of information because other extreme observations should also be 
informative about extremes. Such information (e.g., the second-largest value of a period, which 
can be higher than another period’s largest value) is retained even if we use the values over 
threshold instead of the entire series of observations. 

Furthermore, the design quantities should naturally correspond to the parent distribution, 
rather than the artificially induced maxima over an arbitrarily defined time period. This favours 
the use of the parent distribution. As we have seen (equation (2.122) the two are almost equal for 
very large design values, but for lower ones there are differences. Thus, even if our analyses are 
based in time-block extremes (H(y)), the results should eventually be converted to the parent 
distribution (F(x)) before they are used for design. The above discourse provides all the 
necessary mathematical support for such conversion. 

The most important reason favouring the study of the complete record over that of block 
maxima and values over threshold is that only the former provided faithful information about 
time dependence of the underlying process. As we have already seen in Chapter 1, such 
dependence may be marked and possibly of long-range type. As we will see in Chapter 6, 
neglecting dependence results in underestimation of extremes. On the other hand, the procedure 
of extracting block maxima leads to severe distortion of the dependence structure (Iliopoulou and 
Koutsoyiannis, 2019), whereas the concept of taking values over threshold relies on a tacit 
assumption of time independence, which may be inappropriate, particularly for the streamflow 
process (Lombardo et al., 2019). 


Chapter 3. Stochastic processes and quantification of change 


3.1 Definitions 


A deterministic world view is founded on a concept of sharp exactness. A deterministic 
mathematical description of a system uses regular variables (e.g. x) which are 
represented as numbers. The change of the system state is represented as a trajectory 
x(t), which is the sequence of a system’s states x as time t changes. Changes in time are 
studied using the concept of a dynamical system with certain system dynamics. The latter 
term denotes a transformation S; which maps its initial state x(0) in the trajectory ofa 
dynamical system (at time 0) to its current state x(t) (at time ¢), that is, x(t) = S,(x(0)) 
(Lasota and Mackey 1994). 

In an indeterministic world view there is uncertainty or randomness, where the latter 
term simply means unpredictability or intrinsic uncertainty. In turn, to study the change 
according to this approach we use the notion of a stochastic process. This is defined to be 
an arbitrarily (usually infinitely) large family of stochastic variables x(t) (Papoulis, 1991). 
To each one of them there corresponds an index t, which takes values from an index set T, 
most often referring to time. The time t can be either discrete (when T is the set of integers, 
Z) or continuous (when T is the set of real numbers, R); thus, we have respectively a 
discrete-time or a continuous-time stochastic process. As natural time runs continuously, 
the faithful representation of a natural process needs a model formulated for continuous 
time to avoid the risk of making artificial constructs. Nonetheless, the discrete-time 
representation is certainly necessary in simulation. Typically, the discrete time 
representation x, is derived from the continuous time representation x(t) as the 
temporal average: 


1 tD 
x5 | x(uldu (3.1) 
(t-1)D 


where T € Z represents the continuous-time interval [(t — 1)D, TD] and Dis the time step; 
notice that we use different notation in the continuous and discrete time representation, 
in the latter case denoting time as a subscript. Each of the stochastic variables x(t) or x, 
can be either discrete (e.g. the wet or dry state of a day) or continuous (e.g. the rainfall 
depth); thus, we have respectively a discrete-state or a continuous-state stochastic 
process. 

The index set can also be a vector space, rather than the real line or the set of integers; 
this is the case for instance when we assign a stochastic variable (e.g. rainfall depth) to 
each geographical location (a two dimensional vector space) or to each location and time 
instance (a three-dimensional vector space). Stochastic processes with multidimensional 
index set are also known as stochastic (or random) fields. 

A realization x(t) of a stochastic process x(t), which is a regular (numerical) function 
of the time ¢, is known as a sample function. Typically, a realization can be known 
(simulated) at countable time instances, i.e. in discrete time (not in continuous time, even 
in a continuous-time process). Likewise, observation of a natural process is also made in 
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discrete time. The sequences of simulated or observed values are called a time series. 
Clearly then, a time series is a finite sequence of numbers, whereas a stochastic process is 
a family of stochastic variables, infinitely many for discrete time processes and 
uncountably infinitely many for continuous time processes. A large body in literature does 
not make this distinction and confuses stochastic processes with time series (see 
Digression 3.E). 


3.2. Distribution function and moments 


The distribution function of the stochastic variable x(t) i.e., 
FOG) <= P{x(t) < x} (3.2) 


is called first order distribution function of the process. Likewise, the second order 
distribution function is: 


F(x, X23 ti, tz) = P{x(t,) < X41) x(tz) < Xo} (3.3) 
and the nth order distribution function is: 
POG oriap ha tile vta = P{x(t,) = iy Xs) SIG as tA) S eh (3.4) 


A stochastic process is completely determined if we know the nth order distribution 
function for any n. The nth order probability density function of the process is derived by 
taking the derivatives of the distribution function with respect to all x;. 

The moments are defined in the same manner as in sections 2.12 and 2.14. Of particular 
interest are the following: 


1. The process mean, i.e. the expected value of the variable x(t): 


co 


u(t) = E[x(t)] = | xf (x; t)dt (3.5) 


—oo 


2. The process variance, i.e. the variance of the variable x(t): 


yo(t) = var[x(] = f (x- nO)" Fos dae (3.6) 


3. The process autocovariance, i.e. the covariance of the stochastic variables 
x(t) and x(t + h): 


c(t;h) = cov|x(t), x(t + A] = E[(x(Q) — H) (xe +h)-ue+h))) 7) 


where c(t; 0) = yo(t). 
4. The process autocorrelation, i.e., the correlation coefficient of the stochastic 
variables x(t) and x(t + h)): 


c(t; h) 


VVoO)Vo(t + h) 


r(t;h) = corr|x(t),x(¢ + h)| = (3.8) 
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Additional characteristics will be given in section 3.5. 


3.3 Stationarity 


The term process has been introduced in the scientific vocabulary as synonymous to 
change, as evident in Kolmogorov’s (1931) pioneering paper, in which he introduced the 
term stochastic process. This paper starts stating “A physical process [is] a change of a 
certain physical system”. 

It is very common in science to try to identify invariant properties within change 
(Koutsoyiannis 2011a). For example, in the absence of an external force, the position of a 
body in motion changes in time but the velocity is unchanged (Newton’s first law). If a 
constant force is present, then the velocity changes but the acceleration is constant 
(Newton’s second law). If the force changes, e.g. the gravitational force with changing 
distance in planetary motion, the acceleration is no longer constant, but other invariant 
properties emerge, e.g. the angular momentum (Newton’s law of gravitation; see also 
Koutsoyiannis 2011a). 

Evidently, the notion of a stochastic process was invented to describe the irregular 
changes in natural systems more complex than the above, which are impossible to model 
deterministically or predict their future evolution in full detail and with precision. Here, 
the great scientific achievement is the invention of macroscopic descriptions instead of 
modelling the details. This is essentially done using stochastics. Here lies the essence and 
usefulness of the stationarity concept, which seeks invariant properties in complex 
systems (Koutsoyiannis, 2011a, 2014a; Koutsoyiannis and Montanari, 2015). Following 
Kolmogorov (1931, 1938) and Khintchine (1934), a process is stationary if its statistical 
properties are invariant to a shift of time origin, ie. x(t) and x(t’) have the same 
(multivariate) distribution for any t and t’. Furthermore, following Kolmogorov (1947), a 
process is called wide-sense stationary if its mean is constant and its autocovariance 
depends only on time differences, i.e.: 


E[x(o)| = uw (= constant), cov|x(t), x(t + h)| = c(h) (3.9) 


A strict-sense stationary process is also wide-sense stationary but the inverse is not true. 

A process that is not stationary is called nonstationary. In a nonstationary process one 
or more statistical properties depend on time, that is, they are deterministic functions of 
time. A typical case of a nonstationary process is a cumulative process whose mean is 
proportional to time. For instance, let us assume that the rainfall intensity at a 
geographical location and time of the year is modelled as a stationary process x(t), with 
mean pw. Let us further denote X(t) the rainfall depth collected in a large container (a 
cumulative raingauge) at time t and assume that at the time origin, t = 0, the container is 
empty, so that X(t) = je x(s)ds. It is easy then to understand that E[X(0)| = ut. Thisisa 


deterministic (linear) function of time t and thus X(t) is a nonstationary process. 

We should stress that stationarity and nonstationarity are properties of a process, not 
of a sample function or time series. There is some confusion in the literature about this, 
as a lot of studies assume that a time series is stationary or not, or can reveal whether the 
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process is stationary or not. As a general rule, to characterize a process nonstationary, it 
suffices to show that a specific statistical property is a deterministic function of time (as 
in the above example of the raingauge), but this cannot be straightforwardly inferred 
merely from a time series. A time series formed from observations of a natural process 
cannot be stationary, nor nonstationary. 

Stochastic processes describing periodic phenomena, such as those affected by the 
annual cycle of Earth, are nonstationary. For instance, the daily temperature at a mid- 
latitude location could not be regarded as a stationary process. It could be modelled as a 
special kind of a nonstationary process with characteristics depending on time in a 
periodical manner (are periodic functions of time). Such processes are called 
cyclostationary processes. 


3.4 Ergodicity 


Stationarity is also related to another important stochastic concept, ergodicity.” Its 
importance derives from the fact that ergodicity is a prerequisite to make inference from 
data, that is, induction—the Aristotelian émaywyn (epagoge). This is a type of inference 
weaker than deduction—the Aristotelian amddeét¢ (apodeixis) —albeit very useful when 
deduction is not possible. 

In dynamical systems, by definition (e.g. Mackey, 1992, p. 48), ergodicity is the 
property ofa system whose all invariant sets under the dynamic transformation are trivial 
(have zero probability). In other words, in an ergodic transformation starting from any 
point, the trajectory of the system state will visit all other points, without being trapped 
to a certain subset. The ergodic theorem (Birkhoff, 1931; Khintchine, 1933; see also 
Mackey, 1992, p. 54), allows redefining ergodicity within the stochastics domain 
(Papoulis, 1991, p. 427; Koutsoyiannis 2010) in the following manner: A stochastic 


process x(t) is ergodic if the time average of any (integrable) function g (x(e)), as time 


tends to infinity, equals the true (ensemble) expectation, i.e.: 
T 
li : dt = 3.10 
Jim = | 9 (x@) dt = Ely) (3.10) 
0 


for a continuous-time process or: 


T>0 


T 
1 
lim =D g(a) = Blo (3.11) 


for a discrete-time process. The right-hand side in the above equations represents the true 
average, also known as ensemble average, whereas the left-hand side represents the time 





* The concept of ergodicity was first conceived by Boltzmann (1884/85) who coined the terms ergode and 
isodic, both of which are etymologized from Greek words but which ones exactly is uncertain. Most 
probably, ergodic comes from the Greek épyov (ergon = work) and 0dd¢ (hodos = pathway). According to 
another interpretation, the second noun is el50¢ (eidos = form, kind, nature), or the whole word is a 
transliteration of the Greek adjective éoywénc (ergodes = laborious, troublesome; see Mathieu 1988). 
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average, for the limiting case of infinite time. The left-hand side of equations (3.10) and 
(3.11) is a stochastic variable (as a sum or integral of stochastic variables) and is not a 
function of the time t. Hence, the right-hand side should not be a function of the time ¢, i.e. 
the process should be stationary. Furthermore, the right-hand side is a number, not a 
stochastic variable. Equating a number with a stochastic variable implies that the 
stochastic variable has zero variance. This is precisely the condition that makes a process 
ergodic. And this allows the estimation (i.e. approximate calculation) of the true but 
unknown property E[g(x(t))| from its time average, that is, from the available data. 
Without ergodicity inference from data would not be possible. 

A stochastic process for which it can be shown that the property (3.10) or (3.11) holds 


true for the particular case that g (x(t)) = x(t), whose expectation is the mean 


(E[x(t)] = LL), is called mean-ergodic. The property could be extended for the multivariate 
functions g() and thus we can speak about covariance-ergodic processes. Further 
information, including conditions that should hold for ergodicity can be found in Papoulis 
(1991). 

Now, if the system that is modelled in a stochastic framework has deterministic 
dynamics (meaning that a system input will give a single system response, as happens for 
example in most hydrological models), then a theorem applies (Mackey 1992, theorem 
4.5 p. 52), according to which a dynamical system with dynamics S;(x) has a stationary 
probability density ifand only ifit is ergodic. Therefore, a stationary system is also ergodic 
and vice versa, and a nonstationary system is also non-ergodic and vice versa. Here we 
note that even if a system has deterministic dynamics, again it is legitimate to use a 
stochastic description, replacing the study of the evolution of system states S,(x) with the 
evolution of probability densities of states f(x;t); one reason to prefer the stochastic 
description over the pure deterministic description is that the former includes 
quantification of uncertainty, whereas the deterministic dynamics does not eliminate 
uncertainty (Koutsoyiannis 2010). Furthermore, we clarify that the deterministic 
description through the transformation S;(x) is fully compatible with a stochastic 
description that is stationary and ergodic, according to the theorem stated above: while 
the system state is changing in time t according to the transformation S;(x), its statistical 
properties (and the probability density f(x; t)) can be constant in time (i.e. f(x)). 

If the system dynamics is stochastic (a single input could result in multiple outputs), 
then ergodicity and stationarity do not necessarily coincide. However, recalling that a 
stochastic process is a model and not part of the real world, we can always conveniently 
devise a stochastic process that is ergodic, provided that we have excluded 
nonstationarity. In conclusion, from a practical point of view ergodicity can always be 
assumed when there is stationarity, while this assumption if fully justified by the theory 
if the system dynamics is deterministic. Conversely, if nonstationarity is assumed, then 
ergodicity cannot hold, which forbids inference from data. This contradicts the basic 
premise in geosciences, including hydrology and climatology, where data are the only 
reliable information in building models and making inference and prediction. 
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Digression 3.A: Misuses of stationarity and ergodicity 


Despite having a central role in stochastics, the concepts of stationarity and ergodicity have been 
widely misunderstood and broadly misused (Montanari and Koutsoyiannis, 2014; Koutsoyiannis 
and Montanari, 2015). In an attempt to find trends everywhere, according to the popular motto 
“stationarity is dead” (Milly et al. 2008), trend analysis of hydroclimatic processes is more 
fashionable today than ever before (Iliopoulou and Koutsoyiannis, 2020). The notion of a trend, 
as a fundamental constituent of time series, is very old, but it is fundamentally problematic 
(Koutsoyiannis, 2020a), despite its popularity. 

Ironically, most of these studies use time series data to estimate statistical properties, as if the 
process were ergodic, while at the same time their cursory estimates falsify the ergodicity 
hypothesis. The correct tactic, even when dealing with provably nonstationary and nonergodic 
processes and our study is based on data, is to convert the process to a stationary and ergodic one 
before trying to make any inference from the data. 

As an example, assuming that we deal with the cumulative rainfall process X(t), used as an 
example in section 3.3, we convert the process into a stationary one in discrete time by x, := 
X(tD) — X((t — 1)D), where D is a time step, and perform the same transformation to the time 
series data. Then we can use the x, data to make inferences. 

As a second example related to trends, let us examine a statement such as: “By analysing the 
time series x, (where t denotes discrete time), we concluded that it is nonstationary and we 
identified an increasing trend with slope b.” This is an incorrect statement and can be corrected 
in the following manner: “We analysed the time series x, based on the hypothesis that the 
stochastic process x, - bt is stationary and ergodic, which enabled the estimation of the slope b.” 
The latter statement respects the fact that we always need stationarity and ergodicity to make 
inference from data. 


3.5 Second order characteristics of stochastic processes 


Along with the definition of a stochastic process (section 3.1), we have already provided 
that of the autocovariance function, an important characteristic of the second-order 
distribution function of a stochastic process. However, there are other second-order 
characteristics that are useful to study, as they have certain properties that help 
understand and simulate stochastic processes. 

Before defining them, starting from the process of interest x(t) we will better explain 
the concepts of the cumulative process X(t) and the discrete-time process x,, which have 
already been introduced. As graphically shown in Figure 3.1, the cumulative process is 
defined as: 

t 
xe) = [ xeodu (3.12) 
0 
where obviously X(0) = 0. If x(t) aims to represent a natural processes, then X(t) should 
necessarily be nonstationary. However, the averaged process, X(t)/t, will be stationary, 
provided that x(t) is stationary. With the help of the cumulative process, the discrete-time 
representation of the process (equation (3.1)) can also be written as: 


tD 


Ox - ij i= K((e=1)D) 


; (3.13) 


Ar 
(t-1)D 
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The superscript (D) in xe? denotes the time step of discretization; in cases that we use a 
single discretization step and there is no ambiguity we will omit it, writing x,. 
The variance of X(t) at time t, ie.: 


ri) = var[X(t)] (3.14) 


is known as cumulative climacogram. The variance of the averaged process X(k)/k ata 
time scale k, as a function of time scale k, is the continuous-time variant of the 
climacogram, already discussed in sections 1.3 and 2.16: 


var |= 4 ate (3.15) 





y(k) = 


The autocovariance function c(h) of the continuous-time process x(t) for time lag h, 
already defined in equation (3.7), is related to the climacogram by (Koutsoyiannis 2016): 
1 d?r(h) 


3.16 
2 dh? — ( 


c(h) = cov[x(t),x(t + h)] = 


X(f) = fy x(w)du 
(cumulative, nonstationary) 







x(t) (instantaneous, 
continuous-time process) 








tD 
= x(u)du 
D (t-1)D 


1 
= pk) — X((t - 1)D)) 








I 
I 
1 
0 D 2D us (t-1)D tD t (averaged at time scale D) 


Figure 3.1 Explanatory sketch for a stochastic process in continuous time and its representation 
in discrete time. Note that the graphs display a realization of the process (it is impossible to 
display the process per se) while the notation is for the process per se. 


If we deal with two processes x(t) and y(t) we can define the cross-covariance: 


Cyy(h) = cov [x(), v(t + n)| (3.17) 


This is a continuous-time metric. If we wish to involve also the time scale k of the averaged 
process, we can define the cross-climacogram (Koutsoyiannis, 2019b): 
AA Y(] + Dk) — Y(nk) 


Vey (k; n) = 0,0y Var ES ko, 


(3.18) 
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where Y(k) := fry dt and 7 is lag. 
The structure function (also known as semivariogram or variogram), v(h), is another 
second-order tool, defined as: 


v(h) = 5Var[x(t) — x(t + h)] = c(0) — c(h) (3.19) 


The power spectrum (also known as spectral density), s(w), where w denotes frequency 
is defined as the Fourier transform of the autocovariance function, i.e.: 


co 


s(w) = 4{ c(h) cos(2mwh) dh (3.20) 


The power spectrum should necessarily be nonnegative at all w (s(w) = 0), and this 
entails that the autocovariance c(h) should be a positive definite function. Also, the 
climacogram y(k) should be a positive definite function (Koutsoyiannis, 2017). 

The power spectrum has some analogies with another stochastic tool, the so-called 
climacospectrum (Koutsoyiannis, 2017), which is directly given in terms of the 
climacogram. Specifically, it is proportional to the difference of the variances of the 
averaged process at time scales k and 2k: 


yoy = HH = 70) (3.21) 


The climacospectrum is also written in an alternative manner in terms of frequency w = 
1/k: 
; yQ/w) — y(2/w) 
= 1 Oo 
DQw) = w/w) = Ta 


It is useful to note that the entire area under the power spectrum s(w), as well as that 


(3.22) 


under the curve i(w), are precisely equal to each other and to the variance y». 

All definitions of second-order characteristics in continuous time are gathered 
together in Table 3.1. Once any one of these characteristics is known in the continuous- 
time representation, we can calculate all others in continuous time as well as those in 
discrete time, as shown in Table 3.2. The reverse is not true, i.e., from a model formulated 
in discrete time we cannot infer precisely the characteristics of the continuous-time 
representation. It may be seen in Table 3.2 that the expressions of the discrete time 
characteristics may differ substantially from those in continuous time, and thus attention 
is needed to avoid confusion and misuse. The climacogram and the climacospectrum are 
exceptions, as they are not affected by discretization (they admit the same expressions for 
both continuous and discrete time), and have some additional advantages, such as 
simplicity, close relationship to entropy (see below), and more stable behaviour 
(Dimitriadis and Koutsoyiannis, 2015a; Koutsoyiannis, 2016; 2017) which make them the 
preferable tool in stochastic modelling—even though they are less popular than other 
tools. 


All these tools are transformations of one another, as listed in Table 3.3. 
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Table 3.1 Summary of notation and second order characteristics of a stationary stochastic 


process in continuous time. 












































Name Symbol and definition Remarks Eqn. no. 
Stochastic process ofinterest x(t) Assumed stationary 
Time, continuous t Dimensional 
f 
Cumulative process X(t) = x(€)dé Nonstationary (3.12) 
0 
Variance, instantaneous Yo ‘= var[x(t)] Constant (not a function of t) (3.6) 
Cumulative climacogram rt) = var[X(t)] A function of t; (0) = 0 (3.14) 
X(k)| _ Pk) A function of time scale 
Climacogram y(k) :=va (Pe , 3.15 
: Ke yO) = Yo a 
Time scale, continuous k Units of time 
Climacospectrum w(k) = ee) 
n 
Autocovariance function c(h) = cov[x(t), x(t + h)] c(0) = Yo (3.16) 
Time lag, continuous h Units of time 
Structure function 
pe = h)=y_)—c(h 3.16 
(semivariogram, variogram) uth) = = Var[x(0) x(t + h)| v(A) = Yo — c(h) ( ) 
P t tral 
OER Succi (S Dect s(w) = a{ c(h) cos(2mwh) dh (3.20) 


density) 


0 


| s(w)dw =Yo 





Frequency, continuous 


= 1/k 


Units of inverse time 





Table 3.2 Summary of notation and second-order characteristics of a stationary stochastic 


process in discrete time. 






































Name Symbol and definition Remarks Eqn. no. 
TD 
=5 | xo 
Stochastic process, discrete a= D eueyeu 
; (t-1)D (3.13) 

time 

_ X(tD) -X(@— 1D) 

= D 

: 3 ak: ; Length of time window 
Discretization time step D ; 
of averaging 
Time, discrete t:=t/D Dimensionless 
Climacogram =y(kD) = i) y= var|x. | = y(D) (3.23) 
Y (kD)? 1 AT . 

Climacospectrum vx, = Wk) = 6 Hod 

n 
Time scale, discrete K=k/D Dimensionless (3.24) 
Autocovariance function = Cy *= Coy [Xs Xe Co =V(D) =" 
Time lag, discrete n=h/D Dimensionless (3.25) 
Structure function Un = V1 — Cn (3.26) 

1 ets). , 
Power spectrum Sq(w) = D » Ss (—) sinc (n(w +j)) (3.27) 
j=-0 

Frequency, discrete w =wD =1/k Dimensionless (3.28) 





Note: 
dimensionless ones, as specified 


In time-related quantities, 


above. 


Latin letters denote dimensional quantities and Greek 


letters 
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Table 3.3 Relationships between second-order characteristics of a stochastic process. 
































Related ..... Symbol and definition Inverse relationship eae 
characteristics no. 
; 1 d2(h2y(h 
ykyechh) y(k)= 2{a — yc(vk)dy c(h) = 3 a) (3.29) 
s(w)ec(h) s(w) = a{ c(h) cos(2mwh) dh c(h) = Jo cos(2twh) dw (3.30) 
y(k) @ sw) v(k) = | s(w) sinc? (nwk) dw s(w) = 2 | PY) coscamwh) dh (3.31) 
v(h) e c(h) v(h) = 9 — c(h) c(h) = v(c) — v(h) with v(oo) = yo (3.32) 
(2k 
Wk) OK) Wek) = kk) — (2k) - » .. B33) 
In2 a gee ‘k) 
as oa 
1 ne I (kD) 
re=[co+2) (1-2) |= Goya 1 (C(I + 1D) +n - 11D) 
= n=1 Cy = D2 ——__-._ ee ---- 
Ve = (KD) & _ — 
: where (0) = 0, (D) = c)D* and, (3.34) 
: recursively, = rani) 
I'(kD) = 
2r((k — 1)D) —I((« — 2)D) + 2c_,D? 
() 1/2 
Cy @ Sq(@) Sq(W) = 2c) + 4 y Cy cos(2TINw) = | Sq(w) cos(2mwn) dw (3.35) 
n=1 : 
0, <> Cy Uy = y(D) — Cy Cy = Y(D) — vy, (3.36) 





Digression 3.B: What is dependence in time? 


Dependence of a stochastic process in time (also known as intertemporal dependence or simply 
time dependence) is typically expressed by the autocovariance or the autocorrelation function. In 
turn, its typical interpretation is memory. This has been so common than in many texts the term 
memory has replaced the term dependence—even in the titles of several publications, papers and 
books. Perhaps the scientist who was most influential in establishing this interpretation was 
Mandelbrot (for example, Mandelbrot and Wallis, 1968, speak about short and long memory, both 
of which they contrast to independence), even though other scientists had used the term before 
(e.g. Krumbein, 1968). Clearly, in stochastics the term memory is metaphorical, while in other 
disciplines (neuropsychology, computer science) it is literal. In science there is no reason to use a 
metaphorical term when we have a literal term, particularly when the metaphorical term has 
another scientific meaning. 

Perhaps the metaphorical term memory distracts, rather than helps, intuition and 
understanding of time dependence in a stochastic process. In particular, its variant long memory 
is totally inappropriate as it stimulates people to imagine a mechanism inducing long memory 
(e.g. hundreds of years) and of course it is difficult to conceptualize such a mechanism. A better 
interpretation is a mechanism producing change, rather that recalling information (as is the 
meaning of memory). And indeed, changes produce dependence—not the other way round. 
Furthermore, dependence and change need not be interpreted as nonstationarity as many think. 
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But before discussing how change produces time dependence in a process that is stationary, 
we will discuss how dependence manifests itself into a time series. In one word, this manifestation 
is through patterns. In pure randomness, without time dependence (like in a sequence of dice 
outcomes or in the sequence of digits of 1) no patterns appear. To better illustrate such patterns, 
we examine several time series with a small length, n = 16. For convenience we make these time 
series two-valued, with values -1 and 1 and with average of the 16 values equal to zero, which 
means that eight values will be -1 and eight 1. The estimates of the variance, the lag-one 
autocovariance and the lag-one autocorrelation coefficient will thus be, respectively: 


16 16 m 
~ 1 2 a 1 a Cy a 
TG x, =1, ae XrXq415 1574 
T=1 T=1 f 


where we set X;7 = x; in order to have 16 terms in the sum for ¢, and thus make possible values 
up to +1 (even though this is not typically made in analyses of time series). The formal meaning 
of the term estimate is clarified in section 4.3. 
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Figure 3.2 Examples of arrangements of eight ones and eight minus ones in the form of time series with 
length 16, mean zero and unit variance, along with the resulting estimate of the lag-one autocorrelation 
coefficients r. In addition to the original time series (scale 1; continuous line), time-averaged time series are 
also shown at scales 2 (dashed lines) and 4 (dotted lines). In the bottom right panel, the frequency 
distribution of r for all 16!/(8!)* = 12 870 possible cases (permutations) are shown. 


Some instances of such time series are shown in Figure 3.2. In the upper left panel, all eight 
ones are grouped together so that ¥}39, x,%74, =7+7—2 = 12andf, = 075. This is the highest 
possible value that a particular arrangement of 16 items, each being +1, can give. Obviously, there 
are 16 possible arrangements that will give 7, = 0.75. If our time series had length of N, the 
highest 7, would be (N —4)/N =1-—4/N and would approach the value +1 for large N. 
Consequently, a large autocorrelation is caused by grouping together of similar (in our example 
same) values, and this grouping has been termed persistence. If the grouping appears but is not 
that “perfect”, such as in the lower left panel, then again, the autocorrelation will be positive but 
lower (7, = 0.5 in this example). 
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In contrast, if the patterns appear to be of alternating, rather than grouping, type, then the 
autocorrelation coefficient is negative. Thus, in the “perfect” alternating shape of the upper middle 


panel of Figure 3.2 we have 1°, x,x,4, =—16 and #, = —1. In the lower middle panel 
alternation is not perfect and 7, = —0.75. Finally, the upper right panel is free of patterns and7, = 
0. 


Now, the effect of change is illustrated in Figure 3.3, where we plot a time series generated 
from the normal distribution without time dependence. We now assume that the process is 
affected by a mechanism producing change, namely shifts up and down, at random points in time. 
As illustrated in Figure 3.3 and detailed in the figure caption, in this case patterns are produced 
and (positive) autocorrelation is induced. 

Had such change been describable in deterministic terms, as a deterministic function of time, 
that is, had it been precisely predictable in terms of location of times where it occurs and in terms 
of magnitude of state shifts, we would speak about nonstationarity. But since, as we said, the 
points of change are random points in time, they resist a deterministic description and the entire 
process with the change producing mechanism is a stationary stochastic process with dependence. 
Unfortunately, this simple truth is not widely understood and therefore the inconsistent 
interpretations of change as nonstationarity abound in hydroclimatic literature. 
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Figure 3.3 Illustration of the fact that change causes autocorrelation using a time series of length 20, 
generated from the normal distribution N(0,1) without time dependence; the estimates of the statistical 
characteristics from the time series, plotted as full points connected with continuous lines, are fi = 
—0.05, fo = 0.97, 7, = 0.05. By shifting a time segment up (by +1, items 8-14) and another segment down 
(by -1, items 15-20) we obtain a new time series (empty points connected with dashed lines) in which the 
autocorrelation has become f, = 0.59. 


3.6 Asymptotic power laws and the log-log derivative 


It is quite common that nonnegative functions f(t) defined in [0, ©), whose limits at 0 and 
co exist, are associated with asymptotic power laws as t > 0 and oo (Koutsoyiannis, 2014b, 
2017). Power laws are functions of the form 


f(t) « t? (3.37) 


A power law is visualized on a graph of f(t) plotted against t with logarithmic axes, so 
that the plot forms a straight line with slope b. Formally, the slope b is expressed by the 
log-log derivative (LLD): 
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ddnf@)) _ tf'@ 
d(Int) — f() 


If the power law holds for the entire domain, then f*(t) = b = constant. In this case we 
speak about a simple scaling behaviour. Most often, however, f*(t) is not constant. Of 
particular interest are the asymptotic values for t > 0 and ©, symbolically f*(0) and 
f*(c0), which define two asymptotic power laws. We note that, if 0 < f(0) < ©, then 
f*(0) = 0, which means that f(0) has to be either 0 or oo in order for f*(0) # 0. Basic 
properties of LLD are given in Table 3.4. 


fos 





(3.38) 


Table 3.4 Basic properties of LLD (from Koutsoyiannis, 2017). 


Description Mathematical formula 
Multiplication and addition by constants (A f(t) + u)* = f*(t) 




















Sum of two functions (fi (t) + fo (t))" = A BE : 2 a 2 
Product of two functions (AO (t))" =ffor+io 

Quotient of two functions (Of (t))" =ffoO-fO 

Raise to a power (f(0*)" = Ap "() 

Function composition (Ff ° g)(t))" = (f(g(t))” = f*(g(t)) g*() 


In particular, the asymptotic properties of the second order characteristics of a 
stochastic process for t — 0, where now t denotes time, characterize the local behaviour 
of a process, while those for t — © characterize the global behaviour. We will discuss 
these properties in section 3.8, after introducing the related concept of entropy 
production in section 3.7. 


3.7 Entropy production in stochastic processes 


In a stochastic process the change of uncertainty in time can be quantified by the entropy 
production, i.e. the time derivative of the entropy &[X(t)| of the cumulative process X(t) 
(Koutsoyiannis, 2011b): 


do[Xx (1) 
dt 


A more convenient (and dimensionless) measure is the entropy production in logarithmic 
time (EPLT): 


@'|X(t)] = (3.39) 


do|x] 
d(int) 


For a Gaussian process, the entropy depends on its variance I(t) only (see Table 2.4) and 
is given as: 


g(t) = g|[X(] =o [Xt = (3.40) 


@[X(t)] = 5 In(2neBr'(e)) (3.41) 
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where f is the background measure density, assumed to be constant (Lebesgue). The 
EPLT of a Gaussian process is thus easily shown to be: 


Ot, Oro... oO 


PO) or aya) 2 a) 














(3.42) 


That is, EPLT is visualized and estimated by the slope of a log-log plot of the climacogram. 
We note that if, because of using the cumulative process, the background measure was 
taken ft insteaf of B, the result would be practically the same (plus a constant 1). 

When the past and the present are observed, instead of the unconditional variance y(t) 
we should use a variance yc(t) conditional on the known past and present. This can be 
expressed in terms of the differenced climacogram (Koutsoyiannis, 2017): 


1 


¥c(k) = e(y(k) -y(2k)), ae TO) (3.43) 


We can subsequently define the conditional entropy production in logarithmic time 
(CEPLT) in a manner analogous to (3.42). By also considering the definition of the 
climacospectrum in (3.21) and (3.22), CEPLT can be written as: 


ve) _14+y*@® _1-P*C/o (3.44) 
2 2 2 

Thus, for a Gaussian process the conditional entropy production is given in terms of log- 

log slope of the process climacospectrum. We will use the same result as an 

approximation for non-Gaussian processes too, even though in a non-Gaussian process 

the entropy expression becomes more complicated than (3.41) with other terms 

additional to variance. 





pct) =1+ 


3.8 Asymptotic scaling of second order properties 


EPLT and the CEPLT are related to LLDs of second order tools such as climacogram, 
climacospectrum, power spectrum, etc. With a few exceptions, these slopes are nonzero 
asymptotically, hence entailing asymptotic scaling or asymptotic power laws with the 
LLDs being the scaling exponents. It is intuitive to expect that an emerging asymptotic 
scaling law would provide a good approximation of the true law for a range of scales. 

If the scaling law was appropriate for the entire range of scales, then we would have a 
simple scaling law. Such simple scaling sounds attractive from a mathematical point of 
view, but it turns out to be impossible in physical processes (Koutsoyiannis, 2017; 
Koutsoyiannis et al., 2018; see also below). It is thus physically more realistic to expect 
two different types of asymptotic scaling laws, one in each of the ends of the continuum 
of scales. The respective scaling exponents are given in terms of two parameters, M (to 
give credit to Mandelbrot) and H (to give credit to Hurst) according to the following 
relationships: 


e The parameter M characterizes the local scaling or smoothness or fractal behaviour, 
when k > 0 orw > ©: 
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yc(0) _v*(0) _ w"(0)-1 _ ~s*(00)-1 (3.45) 
2 2 2 2 








M == 9((0)-1= 


e The parameter H characterizes the global scaling or persistence or Hurst- 
Kolmogorov behaviour, when k > © or w > 0: 


# op #00 #00 #00 
vac) _, 8) _ | ch) _ who) +1 


H = gc(~) = 1+ 
2 2 2 (3.46) 


a1 

Se Ee 

These scaling behaviours have emerged from maximum entropy considerations, and 
this may provide the theoretical background in modelling complex natural processes by 
such scaling laws. Generally, scaling laws are a mathematical necessity and could be 
constructed for virtually any continuous function defined in [0, 0). In other words, there 
is no magic in power laws, except that they are, logically and mathematically, a necessity. 











3.9 Bounds of scaling 


Both parameters M and H take on values in the interval (0,1) (with the limiting cases M = 
1 and H = 0 being possible). This fact, combined with equations (3.45) and (3.46), defines 
limits of the possible scaling laws in natural processes. The limits are not quite well known 
and several studies have reported values out of the limits (see Digression 3.C for an 
example about how to avoid such a mistaken result). 

For the global behaviour, it has been shown (Koutsoyiannis et al., 2018) that a process 

with —s*(0) > 1 is nonergodic. As already explained, inference from data is only possible 
when the process is ergodic and thus, claiming that —s*(0) > 1 based on data is self- 
contradictory. Steep slopes (—s*(w) > 1) are mathematically and physically possible for 
medium and large w and indeed they are quite frequent in geophysical and other 
processes. Because of the equality of slopes of power spectrum and climacospectrum, the 
ergodicity limitation holds also for the slope of the climacospectrum, i.e., * (oc) = 
—w*(0) <1. On the other hand, too steep negative asymptotic slopes of the 
climacospectrum are also impossible. Indeed (because of (52)), w*(k) = —p*(1/k) < 
—1 would entail gc(k) < 0 and [¢(k) < 0 (Koutsoyiannis, 2017). This means that the 
variance of the cumulative process would be a decreasing function of time, which is 
absurd. This holds both for the global case (k > 09, in which the conditional variance I; (©) 
equals the unconditional I*(co)) and the local case (k — 0, for the conditional variance 
Tc (0)). 

For the local behaviour, there is another severe limitation imposed by physical 
reasoning. The case p*(0) = —s*(co) < 1 would entail infinite variance. Infinite variance 
would require infinite energy to emerge, which is physically inconsistent (see also section 
2.17). Therefore, the physical lower limit for *(0) = —s*() is 1. A final—and quite 
severe—limitation is an upper bound of the local scaling exponent, which is 3 for =*(0) = 
—s*(co) (Koutsoyiannis, 2017). The problem if this limitation is violated is that the 
resulting autocovariance function is not positive definite or, equivalently, that the 
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resulting power spectrum is not always (for any frequency w) positive but takes on 
negative values for some w. Likewise, the Fourier transform of the climacogram takes on 
negative values for some w. Proof is provided in Koutsoyiannis (2017). 

The above limits define the “green square” of admissible values of gc, M and H in Figure 
3.4, which is also depicted in terms of admissible values of slopes w* and s* (noting that 
s# can, by exception, take on values out of the square when @- (0) = 2 or g¢() = 0). The 
reasons why a process out of the square would be impossible or inconsistent, as discussed 
above, are also marked in the figure. 
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Figure 3.4 Bounds of asymptotic values of CEPLT, y-(0) and g¢ (0), and corresponding bounds 
of the log-log slopes of power spectrum and climacospectrum. The “green square” represents the 
admissible region; note that s* can, by exception, take on values out of the square when (0) = 
2(M = 1) or g-(~) = 0 (H = 0). The reasons why a process out of the square would be 
impossible or inconsistent are also marked. The lines yc (0) = 3/2 (M = 1/2) and @-() = 
1/2 (H = 1/2) define neutrality (which is represented by a Markov process) and support the 
classification of stochastic processes into the indicated four categories (smaller squares within 
the “green square”). 


The centre of the square, with coordinates g;(0) = 3/2, p-() = 1/2 represents a 
neutral process, whose typical representative is the Markov process (to be examined in 
section 3.11). Larger values of gc¢(0) (where M > 1/2) indicate a smooth process and 
smaller ones (where M < 1/2) a rough process. Also, larger values of g¢(0o) (where H > 
1/2) indicate a persistent process and smaller ones (where H < 1/2) an antipersistent 
process. 
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A useful observation in Figure 2 is that the entire “green square” lies below the equality 
line, which means that the same scaling exponent is not possible for both local and global 
behaviour, or else, it is impossible to have a physically realistic simple scaling process. 
There is one exception, the upper-left corner of the “green square”, which corresponds to 
the so-called “pink noise” or “1/fnoise” and will be discussed further in Digression 3.G. 

On the left of the “green square” in Figure 3.4 another square is formed, which 
represents processes that are mathematically feasible but physically unrealistic, because 
they entail infinite variance. In particular, the centre of this square represents the white 
noise, characterized by independence in time, which is discussed in section 3.10. One of 
the diagonals of this square represents the Hurst-Kolmogorov process, discussed in 
section 3.12. 


Digression 3.C: Misuses of stationarity and ergodicity (2) 


Continuing the examples on misuse of the concepts of stationarity and ergodicity in Digression 
3.A, we refer here to another example, whose standard formulation could be: “From the time 
series x,, we calculated the power spectrum and found that its slope for low frequencies is steeper 
than -1, which means that the process is nonstationary.” We note that a large number of studies 
exploring several data sets have reported steep constant slopes of power spectrum, i.e. 6 < -1, 
which are thought to confirm the nonstationarity of the process. The fact is, however, that this 
entire line of thought is theoretically inconsistent and such reported numerical results are 
artefacts due to insufficient data or inadequate estimation algorithms. Once we make the power 
spectrum ofa process as a function of frequency, we have tacitly assumed a stationary process. In 
a nonstationary process, both the autocovariance and the spectral density, i.e. the Fourier 
transform of the autocovariance, are functions of two variables, one being related to “absolute” 
time (see e.g. Dechant and Lutz, 2015). Thus, there is no meaning in using a stationary 
representation (setting the power spectrum as a function of frequency only) and, at the same time, 
claiming nonstationarity. Furthermore, once we use the power spectrum of a process for 
inference, as we always do, we should be aware that inference from data is only possible when 
the process is ergodic. As shown in Koutsoyiannis et al. (2018), in an ergodic process, the 
asymptotic slope on the left tail of the power spectrum cannot be steeper than —1. Thus, there is 
no meaning in reporting slopes in empirical power spectra < —1 and at the same time making 
any claim about the process properties (e.g. of nonstationarity) based on the power spectrum. 
Actually, such a steep slope, when emerging from processing of data, does not suggest that a 
process is non-ergodic; it rather signifies inconsistent estimation. Nonetheless, we should be 
aware, that steep slopes (< —1) are mathematically and physically possible for medium and large 
frequencies—actually they are quite frequent in geophysical processes. 

Consequently, possible reformulations of the above inconsistent statement could be the 
following: 


e We cursorily interpreted a slope steeper than -1 in the power spectrum as evidence of 
nonstationarity, while a simple explanation would be that the frequencies on which our data 
enable calculation of the power spectrum values are too high. 

e We cursorily applied the concept of the power spectrum of a stationary stochastic process, 
forgetting that the empirical power spectrum of a stationary stochastic process is a 
(nonstationary) stochastic process per se (see section 4.10). The high variability of the latter 
(or the inconsistent numerical algorithm we used) resulted in a slope for low frequencies 
steeper than -1, which is absurd. Such a slope would suggest a non-ergodic process while our 
calculations were based on the hypothesis of a stationary and ergodic process. 

e We cursorily applied the concept of the power spectrum of a stationary stochastic process 
using a time series which is realization of a nonstationary stochastic process and we found an 
inconsistent result; therefore, we will repeat the calculations recognizing that the power 
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spectrum of a nonstationary stochastic process is a function of two variables, frequency and 
“absolute” time. 


3.10 White noise: how natural and how white is it? 


We are all familiar with the notion of independent events at discrete time, such as coin, 
dice and roulette wheel experiments. If such an experiment is performed sequentially in 
time, we can model it as a stochastic process v,,T = 1,2 ... with mean w and variance y;. 
For convenience we subtract its mean, defining the process v, := v; — u for which: 


oe, 7=0 

B[ve]=0, — var[v,] = E[v?] = 07, cy = covey, Been] = {9 neg G49 

It is easy to show that the time-averaged process: 

1 TK 
Y= Ly (3.48) 
i=(t-1)k 
has the following properties: 
(«) (x) io] 7 
Ev! | = 0, y= var| v! | = 7 
a (3.49) 
£09 = cov[ul,v09] = 471 1 = 9 


0, n#0 


Is it legitimate to say that the discrete-time process v, originates from a continuous 
time process v(t)? And if yes, what are the properties of the latter? The mathematical 
answer to the former question is positive. To materialize the continuous-time variant it 
suffices to generalize the climacogram in (3.49) changing the time scale from an integer k 
to areal number k := kD: 

o*D 


y(k) = var[v(t)| = a (3.50) 


It is easily seen that if k — 0, the process variance tends to infinity. Thus, to express the 
properties of the continuous-time process, we need to involve the Dirac delta function 
6(t), whose properties are: 

b 

| sceae =1 (3.51) 


a 


co, t=0 

Oe pi ; 

() 0, t#0 

where [a, b] is any interval that contains the 0. To connect the discrete-time process v, to 

the continuous-time process v(t), we assume that the former is the time-average of the 

latter on the time interval of length D, as in equation (3.13). If we define v(t) as a 
stationary stochastic process which has the following properties: 


E[v(t)|=0, —cov[v(t), v(t’)| = E[v(d)uv(t’)] = 07D 6(t - ¢’) (3.52) 
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then it results in a discrete-time process with the properties of equation (3.49). Indeed, 
the variance of x, will be: 


D 


var|x,| = var|x;| = Els] x(t)dt} = se I x(oan | x(s)ds 


0 


D 


1 
D2 


D 

a7D ; 

cee 
0 


DD ‘ DD 
| J Fox) | dtds = = 72 | | 08-5) dtds (3.53) 
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The power spectrum of process v(t) is found (from equation (3.20)) to be constant: 
s(w) = 07D (3.54) 
Because all frequencies w are present in the power spectrum with equal density, the 
process v(t) is called white noise. This name has been given by analogy to white light, 
which is a mixture ofall visible frequencies. We note though that this is a misnomer as the 
power spectrum of the white light is far different from flat. 

While mathematically the white noise is a well-founded concept and useful for many 
theoretical analyses, it may not be physically realistic for several reasons, such as the 
following: 

e Its variance is_ infinite: var|v(t)| =E (eo) = o0*D 6(0) =. If this 
represented a natural process, this process would have infinity energy. 

e Its autocorrelation for lags however small is zero. In a natural process, the 
autocorrelation should be close to 1 for lags close to zero. 

e Its spectral density is nonzero as frequency tends to infinity. 


These problems are remedied by applying some kind of filtering to the process v(t). An 
example is to set an upper limit w, to the frequency, beyond which the spectral density 
becomes zero (a so-called low-pass or high-cut filter). The second-order characteristics of 
the thus obtained stochastic process 0(t) are: 


2 
~ _ 92 i _ 2 s _(o°D, wesw, 
Yo = o°Dw,, €(h) = o*D w, sinc(2Tw,h), S(w) = 4 ae (3.55) 


It may be readily seen that the above three inconsistencies have been remedied. On the 


other hand, the process 0(t) does not yield precisely the process v, in discrete time. 
However, if we choose w, > 1/D, we can obtain good approximations. 


Digression 3.D: Random walk, Wiener process and Brownian motion 

Assuming that the discrete-time white noise process v, is two-valued, e.g. taking on the values +1 
and -1 with equal probabilities p = 0.5 (so that E[v,| = 0), the cumulative process V, := Yij_4 Vi, 
which can take on values in the interval [—t, T], is called a random walk. This is a nonstationary 
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process with its variance being proportional (actually equal in this simple case) to the time t that 
has passed from the beginning of the walk, ie. var|V,| = T. Its mean is zero at all times. 
If both the time t and the state v(t) of the white noise are continuous, then the resulting 


cumulative process V(t) = ls v(s)ds is called the Wiener process. This is again a nonstationary 


process with mean zero and variance proportional to the time ¢, i.e. var|V(t)| = ot, where o” 
has been defined above; note that the quantity 07/2 is known as the diffusion constant. 

The Wiener process is used to model diffusion phenomena and the Brownian motion under 
free conditions, i.e., when there are no bounds in the motion, nor a restoring force (e.g. gravity in 
atmospheric motion). However, in real world systems the motion is not free (these conditions do 
not exist) and the Brownian motion is bound. In such systems the resulting process is not Wiener 
but a stationary process. More information on these processes can be found in Papoulis (1991). 


3.11 The linear Markov process 


We will now discuss a more interesting case of filtering of the white noise by means of a 
stochastic version of a linear differential equation. To establish such an equation, we use 
a simple hydrological system, a linear reservoir with inflow v(t) and outfow x(t). The 
reservoir state is characterised by its storage S(t) and the change in outflow (reservoir 
spill) is assumed (as an approximation) to be proportional to the change in storage, dx = 
dS/a, where a> 0 isa constant with units of time. The continuity equation is dS/dt = v — 
x and if we make the substitution dS = adx we find that the system dynamics is the first- 
order linear differential equation (for a nonlinear version see Digression 9.A): 


dx(t) 
dt 


Now, let us assume that the inflow is a stochastic process and specifically a white noise 
process. For convenience we subtract its mean so that v(t) has the characteristics given 





a + x(t) = v(t) (3.56) 


in equation (3.52). The output x(t) will be a stochastic process as well. Thus, we can write 
the stochastic version of equation (3.56) as: 
dx(¢) 
dt 
As simple as may it seem, the transition from the deterministic version in equation (3.52) 


to the stochastic version in equation (3.57) involves mathematical troubles. In fact, the 
process x(t) is hardly differentiable and the derivative dx(t)/dt does not generally exist. 


a 





+ x) = v(t) (3.57) 


Thus, stochastic differential equations require their own rules of calculus. Here we use 
the following simple rule: We solve the differential equation as if it were deterministic 
with well-defined derivative. Naturally, the mathematical expression of the solution will 
not contain derivatives. In that expression we replace the deterministic functions with 
stochastic processes. Thus, the differentiability problem is bypassed. 

In this manner, the linear differential equation (3.57) is easily solved to give: 


-t/a 





x(t) = x(O)e"*/% + . | vaneredu (3.58) 


0 


We observe in equation (3.58) that: 
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1. The two additive terms on the right-hand side are independent as the outflow of 
the present, x(0), cannot depend on the future inflows v(u),0 <u <t. 

2. The outflow does depend on the outflow of the present, x(0), but not on other x(t) 
of the past (t < 0). 


A stochastic process that has the latter property is called a Markov process. More 


generally, a Markov process is one in which the future does not depend on the past once 
the present is known; symbolically: 


P{x(t)|x(s) = x(s),s < 0 <t} = P{x(t)|x(0) = x(0)} (3.59) 


The particular Markov process x(t) of equation (3.58) can be called the linear Markov 
process and it is also known as Ornstein-Uhlenbeck process, while the stochastic 
differential equation (3.57) is known as the Langevin equation (Papoulis, 1991). The mean 
of the process is: 


E[x(t)] = E[x(0)]e~*/“ (3.60) 


Subtracting equation (3.60) from (3.58), squaring and taking expected values we get: 
t 
= —2t/a o*D —2t/a 2u/aq 
var|x(t)| = var|x(0)| e + ae e u 
0 (3.61) 
2 


o*D o*D 
= eee —2t/a 
oe + (varlx00) Fa )e 


From (3.60) and (3.61) we conclude that E[x(t)] and var|x(t)| tend fast (exponentially) 
to 0 and A := o*D/2a, respectively, regardless of the values E[x(0)| and var[x(0)]. In 
particular, if E[x(0)] = 0 and var|x(0)| = Yo = A, then the process has constant mean (0) 
and variance (A) at all times. 

It is easily seen that the following equation is a consequence of (3.58): 
t+h 
| v(uje“/“du (3.62) 


t 


eh/a 
x(t +h) = x(t)e""/% + 





Multiplying this equation by x(t) and taking expected values we get: 
c(t,h) = Elx(¢ + h)x(t)] = E[x(t)?]e""/* (3.63) 
and in the case {E[x(0)] = 0, var|x(0)| = A} this becomes: 
c(h) =Ae"/% (3.64) 


In other words, the autocovariance is a function of the lag h only and the process is wide- 
sense stationary. The other second-order characteristics of the process in continuous and 
discrete time, derived through the generic equations contained in Table 3.3, are 
summarized in Table 3.5. 

The celebrated linear Markov process is nothing more than filtered white noise 
through a linear differential equation. The filtering eliminates the problems related to the 
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appearance of infinities discussed in section 3.10 and, thus, it is physically consistent. 
Furthermore, the simplicity of the equations of its second-order properties makes it 
attractive and easy to use. On the other hand, its Markovian property, i.e. the 
independence of the future from the past, once the present is known, may contradict our 
perception that history does always influence the future developments. We may thus 
regard it as too simplistic a model of natural reality. Furthermore, the fact that it 
minimizes entropy production for large times (t — o) (Koutsoyiannis, 2011b; see also 
Digression 3.G) may be another obstacle in accepting it as a good model to represent 
natural processes. 


Table 3.5 Second-order characteristics of the Markov process at continuous and discrete time. 





























Property Formula Eqn. no. 
Variance 
Continuous-time process 
=y(0) =c(O)=A 3.65 
(Instantaneous) roe) ( ) 
Averaged process at scale k 2A 1—e ka 
gea p y(k) = 1- = (3.66) 
(Climacogram) k/a k/a 
Autocovariance function 
Continuous-time, lag h c(h) = Ae“ lhl/@ 3.64 
& 
a(1 —e-P/#)’ 
Discrete time, lagn =h/D cy = y(D), cy = ae eM-DP/a, np > 1 (3.67) 
Power spectrum 
Continuous-time, frequency 4ad 
=o 3.68 
ad aw) 1+ (2naw)? (3.68) 





Discrete time, frequency w = 
wD 
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Figure 3.5 Second-order characteristics of a linear Markov process with parameters A = 1,a = 
20 and discretization time step D = 1. The climacogram and climacospectrum are precisely the 
same for the continuous- and discrete-time representations. The autocovariance and the power 
spectrum have some differences between the two representations, which are invisible in the 
former case and visible in the latter. 
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A discretized Markov process at time step D tends to be uncorrelated in time as D 
increases. Therefore, at large time scales the Markov model is indistinguishable from 
white noise: indeed, from equation (3.66) we conclude that for large k (or small a) the 
variance is inversely proportional to the time scale, as in the white noise. Thus, even 
though sometimes it is said that the Markov model reflects short-term persistence, it is 
better not to use the term persistence in this case. Certainly, it entails short-range 
dependence in time. However, its asymptotic properties (cf. equations (3.45) and (3.46)) 
are (Koutsoyiannis, 2011b): 


M= = (0) = = y*(0) = c*(0) = 0, w*() = 2, s*¥(o) =-2 
2 2 (3.70) 


1 1 
H=5, @clo) =5, v¥(c0) =-1, c8(c0) = —c0, p¥(oo) = s#(0) = 0 
Thus, according to the classification of section 3.9, the process is neutral: neither 
antipersistent nor persistent and neither rough nor smooth. 

While the linear differential equation, on which the introduction of the Markov model 
has been based, has some physical basis, the assumption that the inflow is white noise is 
physically problematic, as we clarified in section 3.10. This is another reason making the 
simple Markov model inappropriate for natural systems. This problem, even though 
rarely noticed, is also met in most of the cases of stochastic differential equations, which 
are deterministic equations perturbed by white noise. 

Related to the Markov process in continuous time is the discrete-time process: 


X_ = AXz-1 + Vz + bvz_4 (3.71) 
commonly known as ARMA(1,1), which stands for autoregressive - moving-average 
process of orders (1,1). Here v, is discrete-time white noise with variance o7, and a and b 
are parameters. It can be easily shown (homework) that its second order characteristics 
are interrelated by: 


Co = ac, + (1+ ab+b2)o?, 6; = aes + ba7; Cy =a tc, n2=1 (3.72) 


By comparison with equation (3.67) we see that the ARMA(1,1) process is identical to the 
discrete-time representation of the Markov process if we choose: 


a5 1 —e P/a A 1 —e D/a 2 
a =e D/a Co =¥1 = —-| 1 -———_]}, —_ are) (3.73) 
D/a D/a (D/a) 


Alternatively, if we know the first three terms of the autocovariance function in discrete 
time, then, without referring to the continuous time formulation, the parameter a can be 
found as the ratio 


A=Co/c, (3.74) 


The remaining parameters b and o? can be found from the first two equations in (3.72) in 
terms of cy = y; and c}. 
The special case in which: 
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b=06¢,/cy) =a (3.75) 


is known as the AR(1) process, standing for autoregressive process of order 1. This is the 
limiting case as D/a — 0. It can also appear in a discrete-time representation of the 
Markov process for finite time step D, if we use instantaneous quantities, rather than time 
averages—the so-called sampled process, defined in discrete time as: 


x, = x(tD) (3.76) 
(compare this with (3.13)). The AR(1) process is thus: 
Xp = AXz-1 + Vy (3.77) 
and its second order characteristics are interrelated by: 
Co(l—a?)=o2, c, =altle, (3.78) 


Additional information about discrete time processes of this type is given in Digression 
ab. 


Digression 3.E: The Time Series School and its processes 


The AR(1) and ARMA(1,1) processes discussed in section 3.11 are representatives of bigger 
families of models developed within the Time Series School. It is worth mentioning one more 
process from these families, the AR(2) process, which is: 


Xp = Ay Xz-1 + A2Xz-2 + Vy (3.79) 
It can be easily shown (homework) that its second order characteristics are interrelated by: 
Cy = Qe 4 GoGo + ae, Cy = A4Co + AC), Cy = Q4Cy—-1 + A2Cy-2, N21 (3.80) 


Once the covariances Cg, C,, Cz are known (estimated from data or derived theoretically) the three 
parameters a, @2,0,7 can be easily found as the system of equations is linear. These equations are 
called Yule - Walker equations as they were introduced by Yule (1927) and Walker (1931), both 
British statisticians who, starting from an analysis of sunspot numbers, studied autoregressive 
processes and in particular their periodogram and autocorrelation properties. 

Obviously, higher order AR and ARMA models can be formulated, and actually are in common 
use, along with additional families such as ARIMA(p,d,q) (standing for autoregressive integrated 
moving average models) and ARFIMA(p,d,q) (with the additional ‘F’ standing for fractional). 
However, we will not refer to them, preferring to base our analyses on the Stochastic School, 
pioneered by A. Kolmogorov, which offers more solid grounds, both for foundation and 
application, than the Time Series School. 

We should note, however, that the latter School and its models are way more popular than the 
former in many disciplines, including hydrology and climatology. It appears that the Time Series 
School was initiated by the American economist W.M. Persons. In studying the problem “When to 
buy or sell’, Persons (1919) introduced the study of time series, which he called statistical series, 
and asserted that they “result from the combination of four elements: secular trend, seasonal 
variation, cyclical fluctuation, and a residual factor.” He also proposed methods for “Eliminating 
secular trends” and “Eliminating seasonal variation”. Interestingly, the Ukrainian/Russian/Soviet 
mathematical statistician and economist Slutsky (1927) demonstrated that what Persons (and 
other economists) regarded as cyclical component is nothing but a statistical artefact with no 
essential meaning (see e.g. Kyun and Kim 2006; Barnett, 2006). Subsequently, the notion of a 
cyclical component was abandoned but the decomposition of a time series to the remaining three 
components, trends, seasonal variation and residuals is popular even today. 

Perhaps the first definition of a time series was given by the American statistician Bailey 
(1929); 
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A time series is a series of observations taken at different times and recorded with the time at 
which they were taken. 


The biggest progress in the Time Series School was made in Uppsala by the Norwegian-born 
(with career in Sweden) econometrician and statistician H.O.A. Wold and the New-Zealand-born 
mathematician and statistician P. Whittle, who in their doctoral theses provided the stochastic 
foundation of time series analysis. Wold (1938, 1948) proved that a stochastic process (even 
though he referred to it as a time series) can be decomposed into a regular process (i.e., a process 
linearly equivalent to a white noise process) and a predictable process (i.e., a process that can be 
expressed in terms of its past values). This has been known as Wold’s decomposition. Whittle 
(1951, 1952, 1953) laid the mathematical foundation of autoregressive and moving average 
models in univariate and multivariate setting. Later, in their influential book, Box and Jenkins 
(1970) named these models with the above acronyms and they became popular with these names 
and also with the name Box - Jenkins models (cf. Stigler’s law of eponymy; Stigler, 2002). 

Despite the wider influence of the Time Series School over the Stochastic School, there are 
several problems with the former. First, the term time series is ambiguous, sometimes denoting a 
series of observations as in the original definition of Bailey (1929) (or, equivalently, a realization 
of a stochastic process), and other times denoting the stochastic process per se (as in the 
aforementioned use by Wold). As we have already emphasized, here the term time series is used 
with the first meaning, a series of numbers, while for a series of stochastic variables we use the 
term stochastic process. Second, with the exception of the simplest models of these families, such 
as the AR(1) and ARMA(1,1), time series models are too artificial because, being complicated 
discrete-time models, they do not necessarily correspond to a continuous time process, while 
natural processes typically evolve in continuous time. Furthermore, their identification, typically 
based on the estimation of the autocorrelation function from data, usually neglects estimation 
bias and uncertainly, which in stochastic processes (as opposed to purely random processes) are 
often tremendous (Lombardo et al., 2014). 

Indeed, from their onset (Whittle, 1952), time series models have been tightly associated with 
a large number of parameters and they usually become over-parameterized and thus not 
parsimonious. These parameters are estimated from data, which usually are too few to support a 
reliable estimation. The decomposition of a time series to components, trends, seasonal variation 
and residuals, is fundamentally problematic, despite being popular. Remarkably, a meaningful 
definition of a trend has never been given. Also, it may be hard to conceive how time per se could 
be regarded as an explanatory variable in a complex process and what the logical basis is in 
expressing the statistics of a physical process as a deterministic function of time. Accumulation of 
data series with long time spans (cf. Chapter 1) has shown that, what have been regarded as 
trends, are mostly parts of long term fluctuations (and in accord to Slutsky’s work, they could also 
be regarded as statistical artefacts). Finally, “deseasonalization’—in Persons’s original 
terminology “Eliminating seasonal variation” —is a delusion; we can hardly remove seasonality in 
the multivariate distribution of a stochastic process; what we typically do is in the marginal 
distribution—and thus there is no elimination. 


3.12 The Hurst-Kolmogorov process 


The Hurst-Kolmogorov (HK) process has been already introduced in section 1.3 and its 
discrete-time version was given in equation (1.6). Its continuous-time version is quite 
similar: 


2-2H 


y(k) =A (=) (3.81) 


This equation can serve as the definition of the HK process. By setting H = 1/2 we recover 
equation (3.50), which means that the HK process is a generalization of the white noise. 
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Its other second-order characteristics are given in Table 3.6. Their LLDs are constant for 
all time lags and scales and all frequencies: 


p(k) = @c(k) =H, y*(k) =c*(h) = 2H —-2, p(k) = —-s*(w) = 2H -1 


including their asymptotic values at 0 and ©. Accordingly, M = H-1. 


(3.82) 


Table 3.6 Second-order characteristics of the Hurst-Kolmogorov process at continuous and 


discrete time. 





























Property Formula Eqn. no. 
Variance 
Continuous-time process 
= y(0) = c(0) = 3.83 
(Instantaneous) at a ( ) 
Averaged process at scale k 7 2-2H 
(Climacogram) i aa Ban) 
Autocovariance function 
Qa. 2-2H 
AH(2H — 1) (;) H>1/2 
h 
Continuous-time, lag h c(h) =4A6 (-), H=1/2 (3.84) 
a 
Qy 2-2H h 
AH(2H — 1) (;) +6 (-), H<1/2 
ler | bide za) eee |e 
Discrete time, lag 7 = h/D Cy = A(a/D)?-27 a - ui) (3.85) 
Power spectrum! 
Continuous-time, frequency ne 2aAT(2H + 1)sin (tH) (3.86) 


Ww 


(2taw)2#-1 





1 The power spectrum of the discrete-time (averaged) process exists (it is finite for w > 0) but it does not 
have a closed expression. However, for small frequencies (w = wD < 0.1), the continuous-time expression 
is avery good approximation for the discrete-time process, i.e. sg(w) ~ s(w/D). 
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Figure 3.6 Second-order characteristics of a Hurst-Kolmogorov process with parameters A = 
1,a = 20,H = 0.8 and discretization time step D = 1. The climacogram and climacospectrum are 
precisely the same for the continuous- and discrete-time representations. The autocovariance and 


the power spectrum have some differences between the two representations, which are visible in 
both cases. 
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The process is also known as fractional Gaussian noise (FGN) due to Mandelbrot and 
van Ness (1968), although these authors used a more complicated approach to define it. 
Here we do not use the term FGN as the adjective fractional is not quite informative (there 
cannot bea non-fractional process; note that the white noise, in which H = 0.5 is fractional 
too), the adjective Gaussian is too restrictive (we will implement non-Gaussian HK) and 
the noun noise is too negative and perhaps misleading when we try to describe Nature’s 
processes. As already mentioned, the mathematical process had been earlier proposed by 
Kolmogorov (1940), while Hurst (1951) pioneered the detection in geophysical time 
series of the behaviour described by this process; hence the name HK we use for this 
process. 

Because this process has infinite instantaneous variance, the sampled process in 
discrete time is not meaningful (many characteristics take infinite values). However, the 
averaged process is well behaving with all of its characteristics (including its variance) 
finite, which makes it quite useful in applications. 

The HK process is almost equally simple and parsimonious with the Markov process; 
again, it contains only one parameter, H, in addition to those describing its marginal 
distribution. Notice that the process variance is controlled by the product A a2-*", so that 
Aand a, are notin fact separate parameters. Despite that, we prefer the formulation shown 
in Table 3.6 with three nominal parameters for dimensional consistency: a and A are scale 
parameters with dimensions [t] and [x7], respectively, while H, the Hurst coefficient, is 
dimensionless in the interval (0, 1). 

For H = 1/2 the process reduces to pure white noise. For 1/2 < H < 1 the process is 
persistent and for 0 < H < 1/2 antipersistent. Most of the expressions shown in Table 
3.6 hold in all three cases. However, the autocovariance c(h) has different expressions in 
the three cases, as shown in Table 3.6. Specifically, for H < 1/2, the autocovariance c(h) is 
negative for any lag h > 0, tending to -co as h ~ 0. However, at h = 0,c(0) = +0, 
because this is the variance of the process which cannot be negative; thus, there is a 
discontinuity at h = 0. Consequently, the averaged process has positive variance and all 
covariances negative. Such a process is not physically realistic because real-world events 
at near times are always positively correlated, which means that for small h, c(h) should 
be positive. Also, the infinite variance cannot appear in nature. Thus, the HK process can 
describe natural phenomena only for 1/2 < H < 1 and for time scales not too small. 
Furthermore, values H > 1 that sometimes are being reported in the literature are 
mathematically invalid (Koutsoyiannis, 2014b, 2017; Koutsoyiannis et al. 2018; see also 
Figure 3.4) and are results of inconsistent algorithms. In terms of entropy production, the 
process maximizes it for large times (t > 00) but minimizes it for small times (t > 0). 


Digression 3.F: Developments in stochastic modelling in hydrology before 
and after Hurst 


Hurst’s (1951) discovery of the natural behaviour named after him was triggered by a real-world 
problem of engineering hydrology, the design of reservoirs. This gave hydrology a central role in 
understanding this behaviour and subsequently in the dissemination process to other disciplines. 
It is a further mark of distinction that the large-scale “export” from hydrology to other fields has 
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characterized Hurst’s research, as hydrology is most often an importer of stochastic methods from 
other fields (O’Connell et al., 2016). 

The understanding that hydrological processes cannot be effectively modelled by 
deterministic techniques preceded Hurst’s research. Techniques that could be classified as 
applications of the Monte Carlo method had appeared in the hydrological literature much earlier 
than the “official start” of the Monte Carlo method in 1949 and of Hurst’s (1951) paper. Hazen 
(1914), made a pioneering study in which he introduced the reservoir storage-yield-reliability 
relationship, a concept that would remain unexploited in the western hydrological literature yet 
constituting the scientific basis of modern reservoir design (Klemes, 1987). In that study he 
proposed an empirical simulation technique and formed a synthetic time series by combining 
historical flow records of different rivers ‘spliced’ sequentially together. Sudler (1927) extended 
the work of Hazen by resampling from a sequence of historical river flows using cards, which he 
shuffled to form new sequences of data. Obviously, this method heavily distorts the time 
dependence of river flows whose importance was not known at that time. 

For it was Hurst (1951) who understood that importance along with the omnipresence in 
natural processes of a clustering behaviour of similar events in time, a behaviour that is now 
understood as (long-term) persistence, long-range dependence (LRD) or Hurst-Kolmogorov 
dynamics. In his attempt to compare natural and random events, Hurst performed physical 
experiments to generate random numbers. Specifically, he tossed 10 coins (sixpences) 
simultaneously and repeated this 1025 times (note that 10 binary digits are equivalent to about 
3 decimal digits). As he notes, his rate was 100 random numbers per 35 min (while that would be 
of the order of a microsecond in modern computer environments, even slow ones). He also used 
another method, shuffling and cutting a pack of 52 cards, in which he improved the rate to 100 
random numbers per 20 min. 

The behaviour discovered by Hurst is now known to many disciplines, most prominently in 
information sciences, biological and medical sciences, economics and finance, and geophysical 
sciences—excepting climate science where it is rather unknown. Even within the hydrological 
community it took decades before assimilating Hurst’s discovery of persistence (O’Connell et al. 
2016). Thus, the initial studies implementing primitive variants of stochastic simulation did not 
reproduce LRD. Barnes (1954), in designing a reservoir in Australia, used a table of random 
numbers from normal distribution to generate a 1000-year sequence of synthetic annual data. 
Thomas and Fiering (1962) generated flows correlated in time, but using only the lag-one 
autocorrelation, obviously neglecting LRD. Beard (1965) and Matalas (1967) generated 
concurrent flows at several sites. Chow (1969), and Chow and Kareliotis (1970) systematized the 
use of time series models (in particular—and using their terminology—moving average models, 
sum of harmonics models and autoregression models) and highlighted their value in the economic 
planning of water supply and irrigation projects. It is evident from the above pioneering studies, 
as well as of subsequent myriads of studies, that hydrologists have followed (and today still do) 
the Time Series School rather than the more rigorous Stochastic School. 


3.13 The Filtered Hurst-Kolmogorov process 


The HK process should not be regarded as a model of general validity, but one that it is 
valid for large scales—and we will indeed use it as more physically plausible than 
processes with exponential decrease of autocovariance (e.g. the Markov process). To this 
aim, we can appropriately filter HK to make it a physically consistent process for all scales. 
This is the same with what we did to the white noise to make it physically consistent by 
removing infinities. 

Similar to the white noise process, if we filter an input v(t) that is now an HK process, 
either by a moving average filter or by a linear differential equation system, then it is easy 
to see that the filtered output is a physically realistic process with finite variance y(0), 
practically unaffected climacogram y(k) at large scales, with y*(co) = 2H — 2 (as in the 
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original HK process) but highly modified climacogram at small scales, thus having a valid 
structure with M = ~-(0) — 1 = (w*(0) —1)/2 =H. 

However, to enrich the process we can make the parameter M independent of H, thus 
making it more flexible to model real world data. For the model application it is not 
necessary to specify the linear filter needed to convert the HK process into a filtered 
Hurst-Kolmogorov (FHK) process (in some cases this would be too involved). It suffices 
to specify a convenient expression of the climacogram. Below we provide three such 
expressions (from Koutsoyiannis, 2017). All expressions contain the dimensionless 
parameters M and H with the meaning and values discussed in section 3.8. 


1. The generalized Cauchy-type (FHK-C) climacogram: 


y(k) =A. + (k/a)?™) (3.87) 


2. The generalized Dagum-type (FHK-D) climacogram: 


y(k) =a (: —(1+ (karen) (3.88) 


3. The composite Cauchy-Dagum-type (FHK-CD) climacogram, derived by summing 
an FHK-C with M = 1 and an FHK-D with H = 0: 


y(k) = Ag(d + (k/ay)*)9™* + Ag(1— (1 + (k/at2)*)™) (3.89) 


4. Asecond form of FHK-CD (FHK-CD2), derived by summing an FHK-C with M = 1/2 
and an FHK-D with H = 1/2: 


y(k) = Ag + k/a,)7F* + Ag(1 — (1 + @2/k)-") (3.90) 


FHK-CD in either of the variants (3.89) and (3.90), is most convenient, as the first 
additive term determines merely the persistence of the process and the second one the 
smoothness of the process. In addition, it is more flexible and richer than its constituents, 
as it contains two couples of scale parameters; however, if parsimony is sought, then it 
can take the same number of parameters as each of the constituents by setting a, = a2 = 
aand A, = Az = A (note that, for dimensional consistency, A and a are minimal parameter 
requirements). 

In the special case M = 1 —H both FHK-C and FHK-D result in precisely the same 
expression: 


y(k) = (3.91) 


For large k/a the FHK process tends to the HK one. This is illustrated in Figure 3.7, where 
in addition the linear Markov model (for the same value of the lag-one autocovariance) is 
plotted for comparison. We notice that, as time tends to zero, the Markov and the FHK 
models have the same entropy production while the HK model is associated with minimal 
entropy production. For intermediate times the Markov model gives higher entropy 
production than the other two models, but this is done at the “expense” of giving too low 
entropy production at large time scales, at which both the HK and the FHK give precisely 
the same high entropy production. 
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Figure 3.7 (upper) Climacograms and (lower) EPLT (@(t)) and CEPLT (@-(t)) of the three 
indicated example processes for neutral smoothness (M = 0.5). At time scale D = 1 all three 
processes have the same variance y(1) = 1 and the same autocovariance for lag 1, ol) = 0.5. Their 
parameters are: for the linear Markov process a = 0.8686, A = 1.4176; for the HK process a = 
0.0013539, A = 15.5032, H = 0.7925; for the FHK process a = 0.0013539, A= 15.5093, M = 0.5, H= 
0.7925 (note that for the HK process the parameter set a = A = 1 is equivalent to the above, but the 
former set was preferred in order to be comparable to the FHK). 


Digression 3.G: Entropy production and time series patterns 


The different patterns in time series generated by different M and H (specifically for the Cauchy- 
type climacogram) are illustrated in the plots of Figure 3.8, also in comparison with two other 
models, the white noise (panel (a)) and the linear Markov model (panel (b)). These two serve as 
good benchmark models for comparisons: the former is free of patterns as it reflects pure 
randomness, and the latter is fully neutral (neither rough nor smooth as @¢(0) = 3/2, and neither 
antipersistent nor persistent as @c(~) = 1/2). 
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Figure 3.8 The first fifty terms at time scales k = 1 and 20 of time series produced by various models, along 
with “stamps” of the models (thick lines plotted with respect to the right vertical axes) represented by the 
CEPLT, @c(k). The different models are (a) white noise; (b) Markov; (c) FHK, with CEPLT close to the 
absolute maximum (H = M = 0.97); (d) FHK, with CEPLT close to the absolute minimum (H = M = 0.05); (e) 
FHK, with CEPLT close to the absolute maximum for large scales (H = 0.99) and close to the absolute 
minimum for small scales (M = 0.01); (f) FHK with CEPLT close to the absolute minimum for large scales (H 
= 0.01) and to the absolute maximum (M = 0.99) for small scales. 


The time series plotted in Figure 3.8 were generated by the symmetric moving average (SMA) 
scheme which will be described in Chapter 7, with 1024 coefficients (weights) a. In all cases the 
discretization time scale is D = 1, the characteristic time scale a = 10, and the characteristic 
variance scale A is chosen so that for time scale D, y(D) = 1. The mean is 0 in all cases and the 
marginal distribution is normal. The FHK is implemented using the Cauchy-type climacogram. 
Each of the panels shows the first fifty terms of time series produced by each of the model 
implementations at time scales k = 1 and 20. In addition, each panel contains a “stamp” of the 
specific model represented by the plot of CEPLT, @c(k). In this way the time series patterns can 
be connected to the entropy production of the generating mechanism. 

In panel (c) the CEPLT is close to the absolute maximum both for small and large scales (H = M 
= 0.97 so as to obtain @c(0) = 1.97 = 2 and @c(~) = 0.97 = 1); notable is the very smooth shape at 
scale 1 and the large departures from the mean (which is 0) at scale 20. On the contrary, in panel 
(d) the CEPLT is close to the absolute minimum for all scales (H = M = 0.05, so as to obtain @-(0) 
= 1.05 = 1 and @c() = 0.05  0—for better visualization it was preferred not to use values of H 
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and M < 0.05). Furthermore, in panel (e) the CEPLT is close to the absolute maximum for large 
scales (H = @c(00) = 0.99 = 1) and close to the absolute minimum for small scales (M = 0.01 
resulting in @c(0) = 1.01 ~ 1). Finally, in panel (f) the conditions are opposite to those in (e) ie., 
the CEPLT is equal to the absolute minimum for large scales (H = @c() = 0.01 ~ 0) and to the 
absolute maximum for small scales (M = 0.99 resulting in @c(0) = 1.99 = 2). 

The particular case of panel (e) is close to what is usually called “pink noise” or “1/fnoise’”, as 
the power spectrum has almost constant slope -1 for the entire frequency domain (which is the 
same in the climacospectrum). This means that using the FHK model we can theoretically 
represent and practically produce even “pink noise” in a consistent stationary setting without 
linking it to a nonstationary process (Keshner, 1982; Wornell, 1993), which involves several 
theoretical inconsistencies. Indeed, the small change of slope from 0.99 to 1.01 is not actually 
visible, especially considering the very rough shape of the empirical periodogram, which certainly 
cannot support differentiation between 0.99 and 1. The FHK model can be used also in other ways 
to produce “pink noise”, that is, by selecting a very large (small) parameter a so as to expel from 
our field of vision the asymptotic behaviour on large (small) scales. And we can imagine that in 
several cases of empirical explorations using observations of natural processes, the observation 
resolution and length, compared to characteristic scale(s) of the process, are such as to hide the 
asymptotic behaviour of the process. We can use this as a trick to obtain virtually constant power 
spectrum slopes much steeper than -1. Specifically, we can use a large a that does not allow 
viewing the asymptotic behaviour at low frequencies or large scales and the slope (see example 
in Koutsoyiannis, 2017). But this should not mislead us to interpret the steep slopes as indicators 
of nonstationarity. 


3.14 Dependence and behaviour of extremes 


When we study extremes, we are typically satisfied by specifying the marginal 
distribution. As analysed in Chapter 2, this is generally sufficient for design purposes, 
where the design is based upon the concept of return period. In this respect the 
dependence structure of the process of interest may not affect the design procedure per 
se. However, the dependence in a stochastic process alters substantially the temporal 
distribution of extremes. In a process with dependence there are patterns, and specifically 
periods with clustered extremes and periods with absence or infrequent occurrence of 
extremes. We should thus adapt our perception of the behaviour of extremes to become 
consistent with this reality; without such adaptation our perception is typically guided by 
the “roulette-wheel” paradigm, in which there are no patterns. 

There is an additional, more severe, consequence of the presence of dependence. 
Hydroclimatic studies necessarily rely on data to make inference. Data records are 
typically insufficient and actually become even more so in the presence of extremes. The 
latter problem affects also the specification of the marginal distribution. This is illustrated 
by a simulation experiment in Digression 3.H. Quantification of the consequences will be 
given in Chapter 4 and a way to take into account the dependence in specifying the 
marginal distribution from data will be discussed in Chapter 6. 


Digression 3.H: Relationship of persistence and distribution tail 


To illustrate whether or not (and how) the persistence (or long-range dependence or just change) 
affects the estimation of the marginal distribution of a discrete-time stationary process x, we 
perform a simulation experiment. We assume that the marginal distribution of x, is exponential: 
f;(x|A) = A e7?*. Further, we make two alternative assumptions: 
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(a) that the parameter A is constant, 2 = 5, and 
(b) that A is slowly varying with mean w, =5 and standard gamma distribution, f,(A) = 
Aime /EGowith« =i. — 5: 


To simulate a slowly varying A we initially generate a time series of a stochastic process A’ with 
same distribution as A from the HK process with a high H = 0.95. Then we form a time series of A 
with the rule A; = A; with probability 1/100, otherwise A; = A;_,. The latter rule assures that each 
value J; lasts for 100 time units on the average. The HK process used for J; assures that there is 
change on all scales, not just at scale 100. Koutsoyiannis (2004a) has shown that the unconditional 
distribution of x in this case is Pareto rather than exponential, i.e. f,(x) = « (1+ x)*7h. 

In either of the two alternatives, once A is known at time step T, we generate x, from the 
exponential distribution independently of previous and next x,. In alternative (a), the resulting 
process will be white noise. However, in alternative (b), the change of the parameter induces 
dependence, while the process x, remains stationary (because the change is stochastic, resisting 
a deterministic description). 

Figure 3.9 (upper row) depicts two time series x,, each with length 10 000, generated with 
alternative (a) (left panel) and (b) (right panel). Moving averages for a time scale of 500, also 
plotted in the two panels, indicate the absence of patterns (pure randomness, white noise) in 
alternative (a) and the long-range dependence (not nonstationarity) in alternative (b). 

Now let us assume that this time series represents a hypothetical hydroclimatic process on 
annual scale. Let us further assume that a researcher has a record of fewer than 100 observations. 
Most probably all these refer to the same value of the parameter /;. Consequently, the researcher 
would diagnose that: 


e the process behaves like white noise—and indeed, the slope of the climacogram (Figure 3.9, 
lower right) for scales < 10 (one tenth of the sample size) is -1; 

e the marginal distribution is exponential—because it indeed is exponential conditionally on a 
single value of A. 


The two distributions for constant and varying A (cases (a) and (b)) are shown in the bottom- 
left panel of Figure 3.9, along with the distribution of A in case (b), as empirically derived from the 
simulations. The adoption of the former underestimates the design quantities for large return 
periods. Furthermore, the bottom-right panel shows the dramatic differences in climacograms of 
the two cases. The climacogram in case (b) starts with a slope -1 for scales < 10, but for large 
scales this becomes -0.33, suggesting H = 0.84. The varying slope is consistent with the findings 
of Markonis and Koutsoyiannis (2016) for the rainfall process. Overall, this simulation experiment 
shows two things. 


e Long series are needed to diagnose natural behaviours and in particular the multi-scale change 
in natural processes. 

e The mechanisms producing change may also lead to thickening of the distribution tail and thus 
enhancing the occurrence probability or the intensity of extremes. 


These effects are particularly important when we study maxima, neglecting the small values 
(below a high threshold), a practice that tends to hide the existence of long-range dependence 
even in long records (see Iliopoulou and Koutsoyiannis, 2019). 
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Figure 3.9 Graphs for the hypothetical example studied in Digression 3.H (see text for explanation). 


Chapter 4. Fundamental concepts of statistics and their adaptation to 
stochastic processes 


4.1 Introductory comments 


The first aim of this chapter is to serve as a synopsis (rather than a systematic and 
complete presentation) of fundamental statistical concepts. It is well known that the aim 
of statistics per se is to provide a methodology for drawing conclusions based on 
observations. The conclusions are only inferences based on induction, not deductive 
mathematical proofs (see Digression 4.A); however, if the associated probabilities 
approach 1, they become almost certainties. 

The classical statistical theory is entirely based on the assumption that observations 
are from a sample, a concept (formally defined in section 4.2) whose very definition relies 
on independence of observations. However, when we deal with hydroclimatic processes 
there cannot be independence. Instead of samples we have time series and there is 
dependence in time. Even when we are interested on the spatial behaviour of processes, 
again we have to deal with dependence in space. Hence, the second aim of this chapter is 
to adapt and extend the classical statistical concepts and methodologies to make them 
applicable to a universe in which there is dependence. 

Two major tasks in statistics are estimation and hypothesis testing. Statistical 
estimation can be distinguished in parameter estimation and prediction and can be 
performed either on a point basis (resulting in a single value, typically the expectation, the 
Aristotelian mesotes), or on an interval basis (resulting in an interval in which the quantity 
sought lies, associated with a certain probability or confidence). The results of an 
estimation procedure are called estimates. Uses of statistical estimation in hydroclimatic 
applications include the estimation of parameters of marginal probability distributions or 
of the stochastic model describing the dependence in time, and of distributions quantiles. 
All these concepts are briefly discussed both at a theoretical level, to clarify the concepts 
and avoid misuses, and at a more practical level to illustrate the application of the 
concepts. Statistical hypothesis testing is also an important tool that constitutes the basis 
of decision theory. In hydroclimatic studies, it is useful not only in decision making, but 
also in exploratory tasks, such as in detecting relationships among different processes. 
Hypothesis testing can be performed by the classical framework known as statistical 
significance (related to a null hypothesis) or within a Bayesian framework. These topics are 
not covered in this text. On the other hand, we put emphasis on the concept of order 
statistics (section 4.12), which is much more important when dealing with extremes. 


Digression 4.A: Deduction and induction 


The theory of probability has provided solid scientific grounds for philosophical concepts such as 
indeterminism and causality. In typical scientific and technological applications, probability has 
provided the tools to quantify uncertainty, rationalize decisions under uncertainty, and make 
predictions of future events under uncertainty, in lieu of unsuccessful deterministic predictions 
(see Koutsoyiannis, 2010). 

Quite importantly, probability has also provided the basis for extending the typical 
mathematical logic, offering the mathematical foundation of induction. Thus, probability made it 
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possible to incorporate into mathematics the entire Aristotelian logic, which in addition to 
deductive reasoning or deduction (the Aristotelian apodeixis) also includes induction (the 
Aristotelian epagoge). 

In classical mathematical logic, determinism can be paralleled to the premise that all truth can 
be revealed by deductive reasoning. This type of reasoning consists of repeated application of 
strong syllogisms such as: 


(Premise) If A is true, then B is true; If A is true, then B is true; 
(Evidence) Ais true; B is false; 
(Conclusion) Bis true. A is false. 


Deduction uses a set of axioms to prove propositions known as theorems, which, given the 
premises (axioms), are irrefutable, absolutely true statements. It is also irrefutable that deduction 
is the preferred route to truth; the question is, however, whether or not it has any limits. 

David Hilbert’s famous aphorism (later inscribed in his tombstone at Gottingen) “Wir mtissen 
wissen, wir werden wissen” (“We must know, we will know”), expressed his belief that there were 
no limits in deduction. According to this belief, more formally known as completeness, any 
mathematical statement could be proved or disproved by deduction from axioms. However, 
developments in mathematical logic, and particularly Gédel’s incompleteness theorem, challenged 
the omnipotence of deduction suggesting the usefulness and necessity of induction. 

Induction uses weaker inference rules of the type: 


(Premise) If A is true, then B is true; If A is true, then B is true; 
(Evidence) B is true; Ais false; 
(Conclusion) A becomes more plausible. B becomes less plausible. 


Induction does not offer a proof that a proposition is true or false and may lead to errors. However, 
it is very useful in decision making, when deduction is not possible, which is the case quite 
frequently in the real world and in everyday life (see Jaynes, 2003). 

The important achievement of probability is that it quantifies (expresses in the form of a 
number between 0 and 1) the degree of plausibility of a certain proposition or statement. The 
formal probability framework uses both deduction, for proving theorems, and induction, for 
inference with incomplete information or data. For the latter we use the branch of stochastics 
called statistics. 


4.2 Samples and time series 


Loosely speaking, statistics draws conclusions for a population, based on a sample. 
Although the content of population is not strictly defined in the statistical literature, the 
term describes any collection of objects whose measurable attributes are of interest. The 
population can refer to the real world and be finite (e.g., the inhabitants of Europe, the 
mean annual flows of year 2000 for all river basins on Earth with size greater than 100 
km2). It can also be an abstraction of a real world entity referring to the possible (typically 
infinite) outcomes of a real or a hypothetical experiment (e.g., the population of all 
possible annual flows in a river cross-section). Here we deal with populations of the latter 
type and, because of this, it is not necessary to use the term population at all—and hence 
to define it. Rather, the notions of a stochastic variable and a stochastic process suffice. 
Therefore, we will not use terms like population mean to distinguish from the sample 
mean. Instead, we will refer to the former concept with the terms like true mean, ensemble 
mean or simply mean, where the term ensemble suggests all possible outcomes of 
repeated experiments. 
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On the contrary, the term sample has a clear definition. Specifically, a sample of size (or 
length) n of a stochastic variable x , defined on a basic set Q, with probability density 
function f(x), is a sequence of n independent identically distributed (IID) stochastic 
variables (X1,X2,..,X,) defined on the sample space, = x-+--x 2, each having 
density f(x) (Papoulis, 1990, p. 238). After observation of the variables x;, to each 
variable there corresponds one numerical value. Consequently, we will have a numerical 
sequence X1,X2,...,Xp,, Called the observed sample. It is clear from this definition that a 
sample is not a subset of the population, as some may think, but a concept related to the 
Cartesian product of the population. 

The concept of a sample is, thus, related to sequences of two types: an abstract 
sequence of stochastic variables and the corresponding sequence of their numerical 
values. It has been a common practice to use the term sample indistinguishably for both 
sequences, omitting the term observed from the latter. However, the two concepts are 
fundamentally different and we should be attentive to distinguish each time in which of 
the two cases the term sample refers to. 

The above definition and in particular the IID specification suggests that the 
construction of a sample of size n, or the sampling, is done by performing n repetitions of 
an experiment. The repetitions should be independent to each other and be performed 
under virtually the same conditions. However, in dealing with natural phenomena (out of 
the laboratory) it is not possible to repeat the same experiment, and thus literally there 
cannot be sampling. Instead, what is actually done is measurement of the natural process 
at different times. As a consequence, it is not possible to ensure that independence and 
same conditions hold. Actually, in most cases we can be sure of the opposite. Then the use 
of classical statistics may become dangerous as the estimates and inferences may be 
totally wrong. 

Still, however, we can do our job in a reliable manner if, instead of using classical 
statistics, we rely on stochastics. Actually, there is the following correspondence between 
classical statistical concepts the stochastic concepts: 


Classical statistics (independence) — Statistics within stochastics (dependence) 
Sample > Stochastic process (discrete or discretized) 
Observed sample > Time series 


Typically, the use of stochastics assuming dependence makes the mathematical 
derivations and calculations more complicated, while the resulting uncertainty is greater 
when there is dependence. 


4.3. Expectation and its estimation 


As we have stressed in Chapter 2, functions of stochastic variables, e.g. z := g(x) are 
stochastic variables and expected values of stochastic variables are regular variables; for 
example Ex] and E[g(x)] are constants—neither functions of x nor of x—i.e.: 
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E[x] = | xf(x)dx =p, — Elg(x)] = | g(x) f (x)dx, (4.1) 


where f (x) is the probability density function. It should be stressed that these 
expectations are not time averages. Sometimes to make it clearer we call them true or 
ensemble means, variances, covariances etc. For an ergodic process, true expectations are 
related to time averages through the following asymptotic relationship (section 3.4): 


T 
x 1 
GO = jim>| g(x) dt = Blea] =:6 (4.2) 


We notice that the left-hand side, G@) ,is arandom variable while the right-hand side, G, 
is a regular variable; their equality implies that the variance of G‘~) is zero. 

When dealing with data from a process x(t) with a joint distribution function that is 
unknown, neither the left- nor the right-hand side of (4.2) can be known a priori. 
Assuming that we have a time series, at a time step D, with observations 


x= 7D) se x(u)du,t = 1,...,n (see equation (3.1)) we can approximate the left- 
hand side by: 


é = “>. Be) (4.3) 


The regular variable G is called an estimate of the true expectation G. Replacing in 
equation (4.3) the values x, with the stochastic variables x, we define: 


Gi= a) 9 (xr) (4.4) 


The stochastic variable G is called an estimator of the true expectation G. In classical 


statistics G is also called a statistic, where the latter term denotes a (scalar) function of 


the sample vector x = [x1,x : “apteals 

While the above procedure to form an estimator G of the true expectation G is useful in 
many cases, we should have in mind that many different estimators can be formulated for 
a certain parameter G. An estimator is typically biased (with few exceptions, the most 
notable being the estimator of the mean; see below), meaning that: 


E[G | #6 (4.5) 
A formal definition of bias is: 
b=E|G|-G (4.6) 
An estimator is also characterized by its variance and its mean square error, i.e. 
Yq = var[G |, eg i=E (é 7 G)’| =y¢tb? (4.7) 


An estimator is called: 
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e unbiased if b = 0. 

e consistent if, with probability 1, G —-G-Oasn-@; 
e best if eg is minimum. 

e most efficient if it is unbiased and best. 


The main takeaway and central point of the above discussion is this. When dealing with 
quantification of uncertainty, for each parameter there are four different concepts, with 
slightly different names but very different meaning and content. These are often 
confounded in the literature and the same symbol and name are used for all, which causes 
confusion and may result in wrong conclusions. Table 4.1 clarifies the four different 
concepts using the variance as an example. 


Table 4.1 Different variants of the variance of a stationary process in discrete time, x;, as an 
example for clarifying the four different concepts. 














Name Symbol and definition Type of variable Type of determination 
= Th tical calculati 
Variance > Regular variable (not Soe is 
Y= | @- 4) fi, x)dx from model (by 
(true) e depending on T) : ; 
-co integration) 
n 
j= ~Y =p)? Estimation from data— 
aionce Tel but model is also 
nee where: Regular variable necessary (e.g. to 
1 calculate the estimation 
ia -) ae bias and uncertainty) 
T=1 
= 2) (8) 
Lees xr —H 
Variance oh ; . Theoretical calculation 
. where: Stochastic variable 
estimator from model 
1 
H = n » Xr 
T=1 
ia 
oo) _ 1 (co) \2 F Stochastic variable, 
' yO eT (x = ) ¢ which for an ergodic 
vee e h Theoretical calculation 
ecuinatereabnare process has zero 
limit variance and from model 


i coeena “fx ae PECOMes a regular 
— Too T , oa variable, equal to Yo 

From Table 4.1 we may notice that the data can be used only with one of the variance 
variants, namely the variance estimate, while a theoretical model is necessary to 
determine any of them. Even for the variance estimate, a model is necessary to estimate 
the estimation bias and uncertainty. And before specifying that model, it is fundamentally 
necessary to ensure that the assumptions of stationarity and ergodicity are valid for the 
process and the data we are dealing with. If they are valid, then the four concepts become 
three because the variance estimator limit becomes identical to the true variance. But if 
stationarity and ergodicity do not hold, then one may again use the data, do the 
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calculations and find a result. However, this result is meaningless and cannot be called the 
variance estimate. 


4.4 Moment estimators 


The estimator of the noncentral moment (moment about the origin) of order p, uy, of a 
stochastic variable x, formed according to the method described in section 4.3, is: 


n 
a (4.8) 
i=1 
It can be proved (Kendall and Stewart, 1963, p. 229) that: 
E |i | = us, (4.9) 


Consequently, the noncentral moment estimators are unbiased. If x; is a (IID) sample of 


S14 


jf w 


size n then the variance of the estimator is: 
1 
var [| =a (Usp ai (4.10) 


It can be observed that if the moments are finite, then the variance tends to zero as n > 
oo; therefore, the estimator is consistent. However, if x; is a stochastic process (with time 
dependence) then (4.10) does not hold, even for p as low as 1. 

The estimator of the central moment Mp, is: 


n 
fp = a (x.- A)’ (4.11) 


where fi = [i; is the estimator of the mean. This is a biased estimator for any p > 1. Even 


for relatively low p (e.g. 2-4), the bias can be substantial, in the case that the process 
exhibits long-range dependence (see section 4.6 about the variance). In the case of (IID) 
samples and low p, the bias is much smaller and can be easily quantified (see e.g. 
Koutsoyiannis, 1997). For higher p the estimation of moments becomes almost 
impossible; this applies not only to the biased estimators of central moments, but also to 
the unbiased estimators of noncentral moments. The reasons are the high variance and 
the extraordinarily high skewness of the estimators, which means that their expectation 
can be different from the mode (the most probable value) by orders of magnitude. 
Because of that, classical moments have been called unknowable (see Digression 4.B) and 
their estimation from data is not recommended. In Chapter 6 we will study a new type of 
moments, the knowable moments (K-moments) which can be reliably estimated for high 
orders and are particularly useful in analyses of extremes. 

In the framework developed and followed in this text, we avoid estimation of classical 
moments of order higher than 2. For this reason, in the following sections we will only 
study the estimators of classical moments of orders 1 and 2. 
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Digression 4.B: Are classical moments knowable? 


The estimators of the noncentral moments /2, (or even the central ones if u is known a priori, 


which however is almost never the case) are in theory unbiased, but it is impractical to use them 
in estimation if p > 2 (cf. Lombardo et al. 2014). It is well known that for large p and positive x; 
the following approximate relationship holds: 


n 1/p 
(y st) = max (%;) 
L 


1sisn 
= 


This is related to the well-known mathematical fact that the maximum norm is the limit of the p- 
norm as p > ©. This result can be generalized for x; that are not necessarily positive but satisfy 
the condition max,< j<,(%;) > |min,<j<,(x;)|. A numerical illustration of how fast the 
convergence of the left-hand side to the right-hand side of the above equation is provided in Table 
4.2. 


Table 4.2 Illustration of the fact that raising to a power and adding converges fast to the maximum value. 


Linear, p = 1 Pythagorean, p = 2 Cubic, p = 3 High order, p =8 
3+4=7 32442 = 52 33 + 43 = 4,53 38 + 48 x 48 
3+4+4+12=19 324+ 42+ 122= 132 33+ 434+123=12.23 38+ 48+128+128 


Therefore, for relatively large p the estimate of y, will be: 
n 
1 il @ 
aN fe ass Dp ay ‘ 
nes a oh (max @)) 
i=1 
(Note that for large p the term (1/n) in the right-hand side can be omitted with a negligible error). 
Thus, for an unbounded variable x and for large p, we can conclude that jie while theoretically is 


an unbiased estimator of j,, in practice it is more an estimator of an extreme quantity than an 
estimator of y,. (As we will see in section 4.12, the estimated quantity is the nth order statistic 
raised to power p). This happens because the convergence of fi, to u, is very slow, while the 


convergence to the maximum value is fast. 

This is further illustrated in Figure 4.1 for the eighth moment ofa process specified in the figure 
caption. Even for n as large as 64 000 the sample moment estimate continues to be smaller, by 
several orders of magnitude, than the theoretical value. However, the proximity of the moment 
estimate to the maximum value is evident even for n as small as 10. The jagged shapes of the 
curves are a clear indication of the dominance of maxima in the moment estimation: the steps 
occur when a new higher maximum value enters the sample, while the gradual decreases before 
those are due to the increase of the sample size without a higher maximum value. The ensemble 
simulation results in the right panel show that the 99% prediction limits (see their definition in 
section 4.11) from 1000 simulations are unable to even envelop the true value. 

As a result, unless p is very small, 4, is not a knowable quantity: we cannot infer its value from 
a sample. This is the case even if n is very large as in Figure 4.1. Also, the various fi, are not 


independent to each other as they only differ on the power to which the maximum value is raised. 
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Figure 4.1 IJlustration of the slow convergence of the sample estimate of the eighth noncentral moment to 
its true value, which is depicted as a thick horizontal line and corresponds to a lognormal distribution 
LN(0,1) where the process is an exponentiated Hurst-Kolmogorov process with Hurst parameter H = 0.9. 
(left) The sample moments are estimated from a single simulation of that process with length 64 000, 
where parts of this time series with sample size n from 10 to 64 000 are used for the estimation. Subsetting 
of the time series to sample size n was done either from the beginning to the end (thicker lines) or from the 
end to the beginning (finer lines). Continuous lines in the two cases represent the eighth moment estimates, 
yt, x? /n, and dashed lines represent maximum values, (max < jcn (x;))° /n. (right) Sampling distribution 
of the eighth moment estimator 7, x? /n estimated from 1000 simulated series of length 1000 each and 
visualized by the 99% prediction limits (percentiles), the median and the average, plotted as ratios to the 
true value. Theoretically, the ratio should be 1, but it is smaller by many orders of magnitude, and the 
convergence to 1 is very slow. The ratio to (max, < ien(%i)) 1M also plotted, is close to 1. (Source: 
Koutsoyiannis, 2019a.) 


4.5 Sample mean estimator and effective sample size 


According to (4.12), the estimator of the true mean p is: 
n 
he > 4.12 
aun ane (4.12) 


Another common notation of the mean estimator is x. The estimator is unbiased (E [a| = 


Ex] = p). Its numerical value fi = (1/n) YL, x;, else denoted as X, is called the observed 
mean or the average. If x; is a (IID) sample of size n then the variance of the estimator is: 


var[x] _ 1 (4.13) 
n n 





var [| = 
regardless of the distribution function of x. However, if x; is a stochastic process (with 


dependence) then combining (3.13) and (4.12) we conclude that: 
X(nD) 
f= x) = SS 4.14 


where the superscript in parenthesis indicates that the discretization scale is nD. 
Consequently: 





X(nD) 
—D | =y(nD) = Yn (4.15) 


var [| = var| = 


CLIMACOGRAM ESTIMATOR AND ITS BIAS 119 


Both equations (4.13) and (4.15) suggest that the estimator is consistent (assuming 
ergodicity). However, equations (4.13) and (4.15) may result in quite different values of 
the variance. By means of these two equations we can define the notion of the 
“equivalent” (or “effective”) sample size n’ in the classical statistics (IID) sense 
(Koutsoyiannis and Montanari, 2007). This is the sample size of a hypothetical IID sample 
of a variable x with variance y, whose variance of the mean equals y,; symbolically: 


Y1 , 1 
—_—= oS =— 
a n ve (4.16) 


As an example, in an HK process, in which y, = A(a/nD)2~2" (equation (3.81)), we will 
have: 


nan oF (4.17) 
In white noise (H = 0.5), clearly n’ = n. However, if H = 0.9 andn = 1000 then n’ = 4 (a 
big difference from 1000!). Thus, a time series of 1000 terms of that HK process is 
equivalent to a (classical, IID) sample of only 4 terms. This example shows the dramatic 
increase of uncertainty in case of dependence. 


4.6 Climacogram estimator and its bias 


The typical variance estimator: 


n 
1 2 
fe =%)=—) (x -2) (4.18) 
T=1 


is well known to be biased. It is also well known from elementary classical statistics books 
that the replacement ofn with n — 1 in the denominator of the right-hand side makes the 
estimator unbiased. Thus, the classical variance estimator is: 


n 
1 2 n 
Baad, (e-8) ~ aah oo 
C= 


This is also known as sample variance or unbiased variance estimator. However, the latter 
term is incorrect: In stochastic processes describing natural phenomena, this slight 








change does not make the estimator unbiased. Here we use the term typical when we 
divide the sum by n (equation (4.18)) and classical when we divide by n — 1 (equation 
(4.19)). We will use the same terminology for covariances below and we will explain the 
reasons that we prefer the typical over the classical. 

In stochastic processes the bias can be determined analytically in terms of the 
climacogram as follows (see also Koutsoyiannis 2003, 2011a, 2016): 
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Elal=; P(e -1)-(4” - ‘)| 
“ofS ee-n7]-ateler— Sen) 
+8[(x! (n) - x) | 


Since 7. ee — hu) = = n(x} oe 1) we find after the algebraic manipulations: 


E [7,| =%1— Yn = (1 = ) Vac (1 _ -) 1 (4.21) 
and 
E|pi| = —"_(y, Tn) = ne lS on) a: nN (4.22) 


Likewise, for the climacogram at scale k = kD, if the observation period is L = nk, the 
estimators become: 


n 
2 n 

Y (=a), EPO = PW (4.23) 

T=1 


I~ 
III 
3 i P 


9(k) = 


and their expectations are: 


1—-y( k 
E|P(k)| = vo - yuy= (1-53 \y (0), Ero]=- Se 0 (4.24) 


The above equations show that there is no gain in using the classical estimator (dividing 
by n—1) of variance 7; (or 7*(k)). The equations are simpler if we use the typical 


estimator /, (or 7(k)) (dividing by n). As we will see below, the typical estimator is also 


preferable in fitting distributional parameters. Whatever estimator we use, there is 
estimation bias which should be taken into account in model fitting. 


4.7 Covariance and autocovariance estimators 


The typical and classical estimators of covariance: 


n n 
xy “7 & y-B), y= (e-A(y-a) 425) 
T=1 T=1 


are both biased if x, and y, are stochastic processes non identical to white noise. For 





example, if they are HK processes with common Hurst parameter H, then the expectation 
of ¢,y is (Koutsoyiannis, 2003): 


1 1 
Eley] = (1 = ——) Cxy = (1 = —) Cxy (4.26) 
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In the case of autocovariance estimation, it is common knowledge that there is 
downward bias (Wallis and O’Connell, 1972; Salas, 1993, p. 19.10). The typical estimator 
of the lag 7 autocovariance is: 


n-1 


=F), (8) (eon A) 27 


T=1 


> 


and it has been a common practice to prefer it over the classical estimator (with division 
by n—1 or n—y), particularly when we use autocovariance to estimate the power 
spectrum. The expectation of ¢,, is (see also Koutsoyiannis, 2003): 


E[é,| J=iE Y(t — pt) — (xi? - )) (en - p) - (xi? - x) 
=—E YG -Mlarer=) (4.28) 


-- B|( E} (x0 — > ( (x — #) + (ean — )) ne IGP _ u) | 


Since aC —y)= =(n- n) (xt (n-m) — 1), assuming that 7 is small in comparison with 
n so that we can interchange n-7 and n, and also extend the corresponding sums, we 
obtain after the algebraic manipulations: 


‘ V; iL 
E[¢,] © Cn —Yn = (: — is) ee (1 — —} V1 (4.29) 
Cy n 


An exact equation has been derived in Dimitriadis and Koutsoyiannis (2015a; Table 2). If 
we estimate the autocorrelation coefficient by: 


é 
f, = ll (4.30) 
Va 
then this will be biased. An approximately unbiased estimator would be: 
€, + Mi +Y, 1 1 
SUE = 2 Th _( -—)é, + (4.31) 
=F, M1 +Yn Yn Y¥1 + Yn 


It is stressed that the use of autocovariance and (even more so) of the autocorrelation 
estimates should be avoided in the identification and fitting phases of a stochastic model. 
Identification and fitting are better served by the climacogram (see Digression 4.C). 


Digression 4.C: The climacogram and the climacogram-based metrics 
compared to standard metrics 


The most popular procedure in time series modelling, is to construct the empirical 
autocorrelogram of the time series using equation (4.27) and assess which stochastic process 
(e.g., of AR or ARMA type) is suitable and how many autocorrelation terms should be preserved. 
It is rather easy to illustrate that this technique can completely distort the underlying process. 
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Figure 4.2(a) depicts the autocorrelogram of a time series with length 100, which does not seem 
to have any relationship with the theoretical autocorrelation function of the model from which it 
was constructed. Namely, the model is the FHK with parameters as in the caption of Figure 3.7. 
Clearly, the empirical autocorrelation does not give any hint that the time series stems from a 
process with persistence. With that autocorrelogram one would conclude that an AR(1) model 
with a lag-1 autocorrelation of about 0.4 would be appropriate. 

The reasons for the failure of the autocorrelogram to capture the real behaviour of the process 
are two. First is the bias, as analysed in section 4.7. Second, from equation (3.29) it is seen that 
the autocorrelation is by nature the second derivative of the climacogram standardized by 
variance. Estimation of the second derivative from data is too uncertain and makes a very rough 
graph. 





Theoretical 





—e— Empirical 


=== White noise 





























Cc 
2 (oa) 
— oO 
oO i= 
7) © 
i: s 
ic 
5 . . 
< =— = Theoretical ‘ 
Theoretical adjusted for bias ae 
—+— Empirical Se 
=-= White noise are 
-0.4 + T T T T 01 4 x 
0 20 40 60 80 100 1 10 
Lag Time scale 


Figure 4.2 (left) Autocorrelogram and (right) climacogram of a time series of 100 terms generated from 
the FHK model with parameters as in the caption of Figure 3.7. (Source: Koutsoyiannis, 2016.) 


The alternative of using the periodogram (the estimate of the power spectrum, which is the 
Fourier transform of the autocovariance; see section 4.10) is even worse as it entails an even 
rougher shape and more uncertain estimation than in the autocovariance (see also section 4.10 
and Dimitriadis and Koutsoyiannis, 2015a). 

It is, thus, much preferable to directly use the climacogram instead of the autocorrelogram for 
model identification. For our example time series, this is illustrated in Figure 4.2(b), which 
indicates that the long-term persistence is well captured by the empirical climacogram and the 
parameter H is correctly estimated (H = 0.79, based on the method presented in Koutsoyiannis, 
2003, and Tyralis and Koutsoyiannis, 2011). Additional advantages of the climacogram are (a) its 
intactness on discretization, (b) its close relationship with entropy production and (c) its 
expandability to high-order moments. 


4.8 Parameter estimation of distribution functions - The method of 
moments 


Assuming a stochastic variable x with known distribution function but with unknown 
parameters 0 := [6,62,..., Om’, we can denote the probability density function of x asa 
function f (x, 8). Here, we will examine the problem of the estimation of these parameters 


based on a sample vector x := [x4 Naiy ss o Xn] Specifically, we will present the two most 
popular methods in statistics, namely the moments method and the maximum likelihood 
method. Several other general methods have been developed in statistics for parameter 
estimation, e.g. the maximum entropy method (e.g. Singh and Rajagopal, 1986) and the L- 
moments method (Hosking et al., 1985a,b; Hosking, 1990). Moreover, in practical 
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applications, other types of methods like graphical, tabulated, empirical and semi- 
empirical, have been devised. As will be seen in later chapters, here we prefer a different 
approach based on K-moments, over all above methods. 

The method of moments is based on equating the theoretical moments of x with the 
corresponding sample estimates of noncentral moments. Thus, as m is the number of the 
unknown parameters of the distribution, we can write m equations of the form 


Up = Bp, BP =1jangm (4.32) 


where the theoretical moments j, are functions of the unknown parameters given by: 
00 
Uy = | x? f (x, 0)dx (4.33) 
00 
Thus, the solution of the resulting system of the m equations gives the unknown 
parameters (6,, 62, ..., 8). In general, the system of equations may not be linear and may 
not have an analytical solution. In the latter case the system of equations will be solved 
numerically. 

This method is easy to apply. However, for distributions involving more than two 
parameters, the problem of knowability of moments intervenes and makes the method 
unreliable. Furthermore, when dealing with extremes we must have in mind that they are 
influenced by high-order moments and thus, relying on the lowest-order moment is not 
the best practice (see section 6.15). 


Digression 4.D: Illustration of the method of moments 


As an example of the implementation of the method of moments, we will determine the 
parameters of the normal distribution. The probability density function: 


1 (Cae 
f(x, u,0) = pee (-<) 


has two parameter, and o. Thus, we need two equations. Based on Table 2.3, these equations 
are: 

w=, Of +wW=f,+f? >0* =f, 
where we have used the identity “4 = “4. + wu”. Consequently, the final estimates are 


1x 1x 
=-) > _ py2 
ey Xi» Oo a (x; — fi) 
i=1 t=1 


This estimation is unbiased but that of a? (and o) is biased even in IID statistics (notice in 
the latter equation that the result contains the typical, rather the classical estimate). 

As we have seen in this example, the application of the method of moments is very simple and 
this extends to other distribution functions. 


4.9 Parameter estimation of distribution functions - The maximum 
likelihood method 


While the method of moments is a method of approximation and has several weaknesses, 
as described in section 4.8, the method of maximum likelihood has a strong logical 
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background. We will initially present the method in a Bayesian framework and then we 
will see that it stands also out of that framework. 

The problem that we have to resolve is to find the parameter vector @ from the known 
observations x = x. Since the observations x are known while the parameters @ are 
unknown, we can regard the latter as stochastic variables @. This allows us to assign 0 a 
probability density function fg(@) and also express conditional densities by the Bayes 
theorem (equation (2.14)). This can be written in terms of densities as: 


= fro (x10) 
fojx(O|x) = ace fo(9) (4.34) 


where we have replaced the events A and B with the vectors x and @, respectively. The 


terminology used in the Bayesian framework is: 


e Prior (before observation) probability density for fg (0) 

e Posterior (after observation) probability density for fg),(@|x) 

e Likelihood for the conditional density f,\9(x|@); this is the hypothesized model (i.e. 
distribution for x) given the parameters 6. 


According to this terminology, we can write (4.34) in the following form: 
Posterior « Likelihood x Prior (4.35) 


Since we have to assign @ a single value 9, the most rational choice for that value is the 
mode of its distribution conditional on x = x, i.e., the value that maximizes the posterior 
fo\x(8|x). To find the mode we equate the derivative of the conditional density to 0, i.e.: 


dfg)x(O|x) 2 1 pee ae ae 





© 0 fo(9) + fra(218) ae (4.36) 


dé f(x) 


Since we know nothing about the prior fg(@), we can choose a so-called noninformative 
prior, which does not change with @, i.e.dfg(@)/d0@ = 0. In this case from (4.36) we 
obtain: 
dfuo(x19) _ 
d@ 


which demands that also the likelihood be at maximum. In other words, we find 6, 
demanding that the density f,,9(x|@) have a value as high as possible at the point x = x. 


(4.37) 


If the vector x is part of a stochastic process, determination of f,j9(x|@) can be 
laborious. However, in IID statistics, x isa sample vector with independent items and thus 
the joint probability density function is: 


fao(x19) =| | foCl6) (4.38) 


Thus, we seek a solution of: 
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dq] Tix1 frjo lO) 
de 7 


We can also convert the product to a sum by taking the logarithm of f,)9(x|@): 


0 (4.39) 


L418) = In fejo(#18) = > In fio 19) (4.40) 


The function L( ) is called the log-likelihood function. In this case, the condition of 
maximum is: 


dL(x]9) wo 1 dfo(xil0) _ 
a0 = DFG io. eo 


Both (4.39) and (4.41) are vector equations equivalent to m scalar equations. Solving 
either of them we obtain the values of the m unknown parameters. 


Digression 4.E: Illustration of the maximum likelihood method 


We will determine the parameters of the normal distribution from a sample using the maximum 
likelihood method. The probability density function of the normal distribution is: 


i! ( Ce ~) 
exp | —-——.— 
V210 : 20" 
The likelihood function is: 


1 1x 
f(x|u,0) = Gaae (-ad = 0] 


1=1 


f(x|u, 0) = 





The log-likelihood function is: 


n iv 5 
L(x|p, 0) = — 5 In(2n) —nlino— a) —p) 


1=1 


Taking the derivatives with respect of the unknown parameters y and o and equating them to 0 
we find 


ees oy ee oy oe 
duo? yt LK) eae? ao a. o3 oo LH) aed 
t= — 


and solving the system we obtain the final parameter estimates: 


n n 
_i » eres 2_1 > eer 
ea ae 2) ae: 
j=1 i=1 
The results are precisely identical with those of Digression 4.D, despite the fact that the two 
methods are fundamentally different. The application of the maximum likelihood method is more 
complex than that of the method of moments. The coincidence of results found here is not the rule 
for all distribution functions. On the contrary, in most cases the two methods yield different 
results. 
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4.10 The estimation of power spectrum and the periodogram 


We assume that a stochastic process x(t) is observed on a time-average basis at 
equidistant times TD, tT = 0, ...,n — 1, where D is a time step (a total observation time L = 
nD). We have thus atime series with a finite number, n, of observations x, of the discrete- 
time process x,. If we study the process on the frequency domain, we have the following 
characteristic frequencies, dimensional (w) or dimensionless (w = wD): 


Sampling frequency Wp = 1/D = n/L Wp =WpD=1 
Nyquist frequency Wy =1/2D =n/2L wy =WyD =0.5 
Frequency resolution Ww, =1/L=wp/n @,= w,D=D/L=1/n 


Half frequency resolution w,=1/2L=wp/2n Wz = w2D=D/2L=1/2n 


As we will see, the Nyquist frequency (wy = 0.5) is the maximum frequency on which we 
can make estimates as beyond that the resulting spectrum estimates are repeated in a 
cyclic manner. 

We are interested in estimators of the power spectrum of the discrete-time process x;. 
A first estimator can be established by utilizing the relationship between the power 
spectrum and the autocovariance function (equation (3.35)). From n observations we can 
estimate from equation (4.27) up to n autocovariance terms, Cp, ¢,,...,€,-1 (noting that 
most of them will not be reliably estimated). Then, by truncating equation (3.35) to a finite 
number of terms we can formulate an estimator of the spectrum in the form: 


n-1 
Sq(w) = 2¢) + 4 2 é, cos(2tNw) + 2¢, cos(2Tw) (4.42) 
n=1 


where we have put a last term for ¢, with a weight 2 (instead of 4), which, as we will see 
facilitates and accelerates calculations. If we have n data values x,, then ¢, = 0, but the 
calculation should stand in cases where we use a fewer number of autocorrelations or in 
cases where we process true values rather than estimates (in the latter case, c, # 0). 
While from first glance we can use this equation to estimate $q(w) for any w, the resulting 
values are not always consistent and therefore it is advisable to make estimates for a finite 
number of discrete frequencies w; = jWo, where Wo is either w, or wz with taking integer 
values as we will specify below. 

The inversion of the formula to find the autocovariance estimates from the power 
spectrum estimates is possible through the equation: 


na 
Cn = Wo sf FCO) SS ». $q(a;) cos(2m; ) (4.43) 
0<w ;<0.5 
The estimation of $4(w) is streamlined and accelerated if we use the discrete Fourier 
transform (DFT) and particularly its variant named fast Fourier transform (FFT), for which 
the required software exists on all computational environments. For a sequence of 
numbers x,,T = 0,...,N — 1, the DFT is defined as a sequence uj, j =0,..,N —1, where: 
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N- 
“im e-amtj/N| f= 0, N—1 (4.44) 


ale 


The sequence x, is recovered from the sequence u; by the inverse DFT, which is: 
N-1 
ae ». wee, 7 =0,..,N—1 (4.45) 
j=0 
The FFT is the DFT made by a fast computational algorithm; the fastest case is when n is 


a power of 2. 
To utilize DFT and FFT in determining $g(w) we write equation (4.42) as: 


n n-1 
Sq(w) = ». 2€, cos(21w) + > 2¢, cos(2T]w) (4.46) 
n=0 n=1 


Setting j = 7 for the first sum and j = 2n — n for the second sum we have: 
2n-1 


Sq(w) = > 2¢; cos(2mjw) + » 2€on-; Cos(2m(2n — j)w) (4.47) 


j=nt+1 


If w is an integer multiple of w. = 1/N where N := 2n, then 2nw will be an integer and 
thus cos(2m(2n — j)w) = cos(2Tjw). By setting: 


= 2G nage Sf SNS 
we can simplify (4.47) to: 
N-1 
Sq(w) = >: u; cos(2Tjw) (4.49) 
j=0 


Considering that the imaginary part of u; is zero, setting w, =t/N, and comparing 
equations (4.45) and (4.49), we conclude that §4(w,) is the inverse DFT of u,. If we have 
taken care to choose n a power of 2, N will also be a power of 2 and thus we can use the 
inverse FFT to calculate estimates $4(w,) from estimates ¢, for frequencies w ranging 
from 0 to 0.5 with a resolution wz, = 1/N = 1/2n. The inverse of (4.49) is: 


1 
wy = 2¢)=—) Salo, cos(2nju,), OS fxn (4.50) 


There is an alternative way to produce another estimator of the power spectrum using 
the DFT on the discrete-time process per se, rather than on its autocovariance. 
Specifically, the DFT of x, is: 
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gee) fe Oat (4.51) 


Assuming that x,,tT=0,...,n—1, are real-valued stochastic variables, their 
transformation ee = 0,...,2 — 1, will be complex valued stochastic variables, i.e. uj = 


uj +i Uj, where uj R and uj are real-valued. The inverse DFT ofu; recovers the real-valued 
x7. The sequence of the absolute values of u; multiplied by 2n: 


Sj = 2n|uj|” = 2n ((u®)’ + (wh)*) (4.52) 


is real valued and, as a function of w; = j/n, is known as the periodogram of x,. It is 


another estimator of sq(w) with a resolution w, = 1/n (while in the estimator (4.49) this 
is Wz = 1/2n). The two alternatives of estimating the power spectrum are schematically 
presented in Figure 4.3. 
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Figure 4.3 Schematic of the different paths to estimate the power spectrum. 
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For real-valued x, the stochastic variables u; and S; have the following properties of 
symmetry: 


U=uUp=f, w=0 
(4.53) 
Un p= Up yp Up = Sp-p = Sj = LS fSn-1 


In other words, the real component of u; and S; are symmetric with respect to n/2, while 
the imaginary component is antisymmetric. Consequently, if n is even, then Uh j2 = 0. 
Because of the symmetries, starting with n real numbers x, we end up with n real 
numbers uj and Uj, and n/2 real numbers Sj. The values of S; for frequencies w = j/n < 


0.5 provide all extractable information while larger frequencies do not add anything of 
value. 
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Other interesting properties of the periodogram and the related quantities are: 


1 n-1 n-1 1 n-1 5 n-1 1 5 
—— 2 ag A aa 2 = Yn/2 
—) x3 =) |u| , f,=-) (x,-£) =) [wl en > Sj-FZ, (484) 
T=0 j=0 T=0 j=l 1sjsn/2 


where if n is odd, the last term S,, 72 is set to zero. The latter equation allows decomposing 


2 
the variance estimate 7, into partial components [u;| , each corresponding to a particular 
frequency, which ranges from w, = w,D =1/n to Wy = WyD = 0.5. The frequency 0 
corresponds to the mean estimate and is not related to the variance. Any prominence 


(peak) in one or more [u; iF over the other is very often regarded as evidence of a periodic 
behaviour of the process with a frequency j/n (period n/j/). 

However, claims of periodicities without a deterministic explanation are usually 
meaningless. As evident from the notation in the entire section, all related concepts, 
including the periodogram, are estimators, i.e. stochastic variables, which produce 
estimates. Considered as a sequence of stochastic variables, the periodogram S$; is a 
nonstationary stochastic process indexed by j = 1,...,|n/2]. The same happens with the 
estimator $4(aj), which is a nonstationary stochastic process indexed by j = 1,...,n, as 
well as with the covariance estimator ¢,. The produced shapes in graphs of estimates 
indicate high variability and roughness, and thus possible peaks are most probably 
random effects. Note that by increasing the number of observations, the variability and 


roughness do not necessarily decrease (cf. (4.52), where [u; ig is multiplied by 2n). 

An illustration is given in Figure 4.4 for a time series generated from the discrete-time 
HK process, where several peaks appear, all of which are random effects. A simple 
technique to see that these are random effects is to split the time series into two halves, 
three thirds, etc. and inspect whether the peaks appear systematically in all cases 
(Koutsoyiannis and Georgakakos, 2006). Splitting the time series and taking the average 
of the different parts for the same frequency is a method of smoothing the periodogram 
(for details and other smoothing methods see Papoulis, 1991). The least square trend 
(power law) of the spectrum estimates from autocovariance is also shown in the log-log 
spectrum plot of Figure 4.4 (bottom-right). The slope is —1.24, an inconsistent value as 
theoretically the slope cannot be steeper than —1 (the slope of the theoretical curve, also 
shown in the figure, is 1 — 2H = —0.6 > —1). This inconsistency is not expected to be 
resolved by the aforementioned smoothing of the power spectrum. For these reasons, the 
use of the climacospectrum, over the power spectrum, is recommended for estimation of 
slopes (Koutsoyiannis, 2017). 


4.11 Interval estimation and confidence intervals 


An interval estimate of a parameter A of a distribution function is an interval of the form 
(0,,82), where 6, and 62 are functions of the observed sample vector x, i.e., 8; = g(X) 
and 62 = g2(%). If we replace the observed sample with the sample (or the part of a 
stochastic process), then the interval’s limits become stochastic variables, 9, = g,(x) and 
82 = g2(x). The interval (@,, 82) is an interval estimator of the parameter A. 
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Figure 4.4 (upper) A plot of a time series with n = 512 terms generated from the Gaussian HK 
model with H = 0.8, u = 100, y, = 400. (middle) The autocovariance and power spectrum of the 
generating stochastic process and their estimates. (lower) Same as middle but with logarithmic 
axes. The least square trend (power law) of the estimates from autocovariance, with slope = 
—1.24 is also plotted in the spectrum panel. 


We say that the interval (@,, 82) is a C-confidence interval of the parameter A if: 
P{0,<A<O,} =C (4.55) 


where C is a given constant (0 < C < 1) called the confidence coefficient, and the limits 
8,, 95 are called C-confidence limits. Usually, we choose values of C near 1 (e.g. 0.9, 0.95, 
0.99, so that the inequality in (4.55) become near certainty). In practice the term 
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confidence limits is often (loosely) used to describe the numerical values of the statistics 
8,, 2, whereas the same happens for the term confidence interval. 

In order to provide a method for the calculation of a confidence interval, we will assume 
that the statistic 6 = g(x) is a point estimator of the parameter A with distribution 
function Fg(@). Based on this distribution function it is possible to calculate two positive 
numbers é1 and é2, so that the estimation error @ — J lie in the interval (-&, 2) with 
probability C, i-e.: 


PQ—-& <@<A+8}=C (4.56) 


and at the same time the interval (-é1, €2) be as small as possible. If the distribution of 6 is 
symmetric then the interval (-&, 2) has minimum length for é1 = &2. For asymmetric 
distributions, it is difficult to calculate the minimum interval, thus we simplify the 
problem by splitting (4.56) into two equations, namely, P{o <A- &,} = P{o >At 
&} = (1-C)/ 2. Equation (4.56) can be written as: 


P{A—& <A<O+&}=C (4.57) 


Consequently, the confidence limits we are seeking are 0; = 6 — €, and 6, = @ — &4. 

Although equations (4.56) and (4.57) are equivalent, their statistical interpretations 
differ. The former is a prediction, i.e., it gives the prediction interval* of the stochastic 
variable @. The latter is an interval parameter estimator, i.e., it gives the confidence limits 
of the unknown parameter A, which is not a stochastic variable. 

Classical statistical texts provide expressions for interval estimators of some common 
parameters, such as the mean and variance of the normal distribution of IID samples. 
However, in most real-world cases we deal with problems much more demanding than 
such idealized cases. The distributions may be non-normal, the parameter of interest may 
not be the mean or the variance, and instead of asample we may have a stochastic process. 
Then analytical calculation of confidence limits becomes impossible. Naturally, the 
method of choice for such (that is, most) cases is the Monte Carlo simulation. General 
methodologies for tackling the problem have been proposed by Tyralis et al. (2013) and 
Tyralis and Koutsoyiannis (2014). 


4.12 Order statistics 


Let x be a stochastic variable and xj, X2, ..., X, be IID copies of it, forming a sample. We can 
rearrange them in increasing order of magnitude such that x;;.,) be the ith smallest of the 
Nn, 1.€.: 

X(a:n) = X(2:n) SS X(n:n) (4.58) 
The stochastic variable x(;.)) is termed the ith order statistic. The minimum and maximum 


are, respectively, 


“ The terms confidence limits, confidence interval, confidence coefficient etc. are also used for this 
prediction form of the equation. 
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Xan) = min(x1, Xo, ie) Xn) = Xcn:n) = max(Xx1, Xp, aes) (4.59) 


and represent special cases of the order statistics, the lowest and the highest. 
If f(x) and F(x) are respectively the density and distribution function of x, then the 
density function of y := X(j.n) is (Papoulis 1990): 


fy) = fem) = (2-1 +) (,",) (FOO) A-FO))" FO) (4.60) 


Now if we define the stochastic variable u := F (y) = F (X(im)), then according to (2.11): 


fy(F*@)) n . uw-ld—ur 
SS St =a) = 4.61 
BNE Gay Ye a a a canvas Waa 
This is the density of the Beta distribution function and hence: 
B iy (ee es ll 
Fein (X) = Pliny <x} = Plu < PQQ} = ORY (4.62) 


Bii,n —i +1) 
For the special cases of the minimum and maximum we have, respectively, 


B x 1, n x 
Fam) a 1—(1—F(x))", Feany(x) = ae ) =(F(x))" (4-63) 


As we will see in Chapter 5 and Chapter 6, the order statistics are quite important for 
studying extremes. 


4.13 Samples vs. time series and forecast-oriented estimation 


As we have seen, in classical statistics, samples are by definition sets of IID stochastic 
variables. Classical statistical estimations make use of the entire vector of available 
observations. But what if instead of a sample we have a stochastic process with time 
dependence and instead of an observed sample we have a time series? Apparently, things 
can be quite different and generally we should avoid uncritical use of classical statistics. 
To illustrate the difference, we consider the following problem: How many past terms will 
we use to estimate an average that is representative of the future mean for a period of 
length x? This is not necessarily the “global” average estimated by the entire time series 
of observations. 

The statistic sought is the “local” mean of the future period of length x conditional on 
the present and past, i.e.: 


1 
My = E = (x1 +++ Xx) 1X0, X—1 | het) 


Let us assume that we have a large number n of observations of the present and past but 
we choose to use v < n of them for the estimation: 


1 
fly = > (Xo + Xa t+ v4) (4.65) 
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To answer the question, it suffices to find v which minimizes the mean square error 


2 
A(k,v) :=E | (iy — fix) | which can be written as: 


A(k,v) =E (Ee +++ xX) ~~ (x +X yt" + xv11)] | 


(4.66) 
x 2 ee: x. 2 
=e|(-= bees Sy SH 4 SH) | 
Vv v K K 
As demonstrated in Appendix 4-I, this is expressed in terms of the climacogram as: 
1 1 
A(ev) = (= +=) (ey) +) = VO YW +0) (4.67) 


We will discuss now a few examples. First is the Hurst-Kolmogorov process, for which 
y(k) = A(k/a)?"~?. As explained in Appendix 4-I, the value of v that minimizes A is: 
K 


’ = Gnax(0, 25H — 1.5))25 (4.68) 


If H < 0.6, this yields v = co, which means that the future mean estimate is the average of 
the entire set of n observations, the global mean. However, if H > 0.6, then itcan bev <n 
and, hence, we should use a local mean with fewer terms than n to estimate the future 
mean. As H > 1, v > 1, too. A graphical illustration of equation (4.68) is given in Figure 
4.5 (left). 

We recall, though, that the Hurst-Kolmogorov process entails infinite instantaneous 
variance and thus it is not an ideal model for real-world processes. The second example 
is a filtered Hurst-Kolmogorov process in its simplest Cauchy form (FHK-C) with M = 1/2, 
ie, y(K) = A(1+ K/a)?"~*. This has finite instantaneous variance, equal to A. Studying 
this process and in particular considering the specific values A(x, 1), A(x, 2), as given by 
(4.67), and A(x, 00) = y(k), we will see that there are cases where: 


A(k, 1) < min(A(x, 2), A(k, ©)) (4.69) 


In such cases the resulting optimal v equals one, which means that only the present value 
should count for the future mean. A systematic numerical investigation on equation (4.69) 
suggested that v = 1 is optimal when: 


K <k, © 2.3(a + 1)H? -1 (4.70) 


Combining the above results, we find that an approximate general solution for the above 
FHK-C model is: 


( K— Ky ) ( K — 2.3(a+1)H?-1 
vy = max = max 


) Gnax(0,2.5H —15))25 1, ae (4.71) 


Characteristic results are given graphically in Figure 4.5 (right). It can be seen that the 
result v = 1 is not uncommon as it appears for many parameter combinations. More 
generally, finite values of v of the order of k or somewhat larger are common for k < 10 
(for example for H = 0.75 and a = 10, the optimal v is 1 for x = 10 and increases to v = 20 
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for k= 15). The case H = 0.5 is virtually equivalent to a Markov process. As shown in Figure 
4.5 (right) for this case (particularly for a = 10), the plot is a vertical line at « = x1 and this 
means the optimal value is either v = 1 (for k < x1) or v = n (for k > x1). In order for this to 
happen, x1 must be 2 1, which happens when a 2 2.5 (otherwise, v = n for any k). 

Note that here we considered the question: Which of the local past averages is most 
representative as an estimate of the future average? We did not consider weighted 
averages of past values, even though this could reduce estimation variance. Therefore, the 
case where the resulting optimal value is v= 1 does not suggest that the process is a 
martingale’. This analysis aims to show the differences of global and local time averages 
and the fact that the latter may provide better prediction for the future in processes with 
dependence. A detailed study on the subject using real-world (rainfall) data has been 
made by Iliopoulou and Koutsoyiannis (2020). An illustration using weighted rather than 
standard averages is given in Digression 4.F. 
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Figure 4.5 (left) Graphical illustration of equation (4.68) assuming a large length n of the time 
series. (right) Characteristic curves of optimal v vs x for the indicated values of parameters H and 
aas found by numerical analysis (and approximated by equation (4.71)); continuous, dotted and 
dashed lines correspond to a = 0.1, 1 and 10, respectively (for the case H = 0.5 the curves for a = 
0.1 and 1 fall out of the graph and therefore only that for a = 10 is shown, which is a vertical 
straight line). 


Digression 4.F: Forecast-oriented estimation using weights 


If we use weights b; to estimate a weighted average of past values which will be representative of 
the future average, then equation (4.66) is replaced by: 


2 
1 
A(b,k,v) =E ( (a1 +++ +X4) — (bi X0 + box +> + bt-vet)} | (4.72) 


Assuming that b, + :--+ b, = 1, and setting y = (x1 GP 828 ap Ne) fe, Wj ‘= y — X;, We Can write: 





* A martingale is a stochastic process in which E |x; (Xo; %—1, | = Xp. 
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A(b,x,v) = E[(b1Q- x4) i: + by(y—4y)) | (4.73) 
or in vector form: 
A(b, x,v) = E|(b"w) | = E[b"wwb] = b™Cb (4.74) 
where b := [by,..., by]", w == [w,,..., wy] and C = E[ww"]. The elements of Care: 
Cy = Elm] = E[(y-x)(y-x))] = 8 [y?]-Elay]- Eley] + ele] = 75) 


The first term of the sum (assuming zero mean, without loss of generality) is E [y?| =y(k) = 
I'(«)/«?. The last term is: 


il 
Elxixj] = ep—j = 5 (Ci -—F + 1) +PCi-J - 1) - 2r(i-J) (4.76) 


To find the middle terms of the sum we observe that X,,_, = X, — Xx, and hence: 


Be eee = Bea Oe Dy 


(5) 2 | 5 5 (4.77) 
while because of symmetry E[X Xx | = E| XX; |: Thus, 
E[x y]|=— E[xi(Xy+x eg Xx. whe Ez (Xan i417 4v- i+1)] 
(4.78) 


il 
=O et) Sie et) iy 2) 
Once we have calculated the elements of the symmetric matrix C as above, what it remains is to: 
minimize A'(b, k, v) == A(b,K,v) + A(1"b — 1) = b'Ch + A(1"b — 1) (4.79) 
The solution is given by: 


0A'(b,k,v) 0A'(b,k,v) 
= 2b C+ 11' =0 a 
0b : : OA 
where 0 and 1 are vectors with all elements equal to zero and one, respectively. This solution can 
be written in a concise form as: 


=1'b-1=0 (4.80) 


b'=C' d (4.81) 


where: 


v=f a= onl 9 00 


Apparently, this case is more complicated than the one studied in section 5.6 and no analytical, 
exact or approximate, relationship can be formulated. However, it is interesting to see a numerical 
illustration, such as that depicted in Figure 4.6. 

This figure allows to make the following observations: 


e When there is no persistence (H = 0.5) and almost no dependence (a = 0.1), the weights are 
almost equal, b; ~ 1/100 = 0.01. This is equivalent to choosing the global average of the past 
for inferring the future average. 

e When a is small (@ = 0.1), ie. the behaviour is close to the standard HK process, then the 
weights form a curve with almost constant slope, with higher H corresponding to steeper 
negative slope. 

e The weight of the first term (the present) is highest. 
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e The weight of the last term (x_99) is higher than the adjacent ones. The explanation is that it 
represents the unknown (and thus neglected) terms beyond x_99. 

e The sequences with a large time scale parameter a have a negative weight for t = 1. This means 
that that the model takes into account the latest “trend”, in addition to the latest value. 


To explain the last point, let us examine the case of the prediction based on the present value 
X, and a single past value x_,. If we used the “global” mean for the prediction, we would have x, = 
(Xo + x_1)/2. If we used just the “trend”, then we would have x, = 2x9 — x_,.If we took the mean 
of the two, we would have x; = 1.25x9 — 0.25x_,. In the last two cases the weight of the term x_, 
is negative. 


H=0.5,a=0.1 ---- H=0.5,a=10 
H=0.65,a=0.1 | — — H=0.65,a=10 
em H = 0.75,a=0.1 ee @H=0.75,a=10 
emp H=0.99,0=0.1 | em H=0.99,a=10 
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Figure 4.6 Illustration of the weights of the present and past values of time series of length v = 100 for the 
prediction of future mean for length x = 10 assuming an FHK-C process with M = 0.5 and the indicated values 
of parameters H and a. Note that for a = 10 (indicating high correlation at small lags) and t = -1, the weights 
are negative and cannot be shown in the logarithmic plot. 


Appendix 4-I: Proof of equations (4.67)-(4.68) 


From equation (4.66) we find, in terms of the cumulative process X;: 


150) 4 a) $8) 4-2) 


K 


1 


1 tf 452 271 ) 
= 7 EL XS +4] a (- ate -) E[X3] ~ K (— au -) E[Xy4-Xy]| 


On the other hand, we have: 
2 2 1 11 : 2 2 
E|X? | =E [(Xv4 _ X,) | =E (Xv a (- + -) X,) = Ela a0 E[X2] ~ El Xiph (4.84) 


Thus: 


2 


1 1 1 1/1 1 
A(k,v) = <7 EL XS +4] a (=+ v E[X7] + -(=+ -) (E[Xz | ~ E[XF+x| = E[X7]) (4.85) 
= ~(<+—) ex? | +°(-+-) EL] ~ Ex | 
K\k Vv ok v\x v Ay KV OAyvt+k 


Hence: 
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1 1\/f@) TO) Tower 
A(k,v) =(5+5)( ( dy ( ed (4.86) 
K Vv K Vv v+kK 
which can be written in the form of (4.67). 
For an HK process: 
A(k,v) 11) ona yy 2H-4 2H-1 
a (ees os aoe = 4.87 
7 q2-2H (- + -) (x tv (v +k) ) (4.87) 
Assuming that x and H are specified, we minimize the quantity: 
A(x, v)Ki 24 vt iw v\2H-1 oy 2H-1 
pl ie ee oe = mt = 4.88 
+ TH (-) (- i$ 1) (1 - (-) (- ms 1) ) (4.88) 


which is a function if v/x. For approximation we assume that v/x is a real number and we take 
the derivative with respect to it, which we equate to zero to find the value that minimizes A. This 
can be solved only numerically. By performing a systematic numerical investigation (finding 
optimal values of v/x for different H) we are able to fit equation (4.68) on the results with a small 
error. 


Chapter 5. Return period 


5.1 Definitions and insights on return period 


We have already introduced the concept of the return period T in section 1.5, where we 
have seen that it is inversely proportional to the probability P, that a dangerous event A 
would occur ina time unit D. Although this relationship (equation (1.7)) is almost obvious, 
here we will approach it again in a rigorous manner and examine several variants of it. 

First, we define the concept. For a specific event A, which is a subset of a basic set Q, we 
define the return period, T, as the mean time between consecutive occurrences of the 
event A. This is a standard term in engineering applications (in engineering hydrology in 
particular) but needs some clarification to avoid common misuses and frequent 
confusion. 

We will initially consider the discrete time version and we will later see how we can 
reformulate it in continuous time. Let B be the complementary event of A (B := 2 — A). 
We denote: 


P, := P(A), Pyy = P(Ao, Ay) (5.1) 


We examine sequences of events B, possibly in contact with events A, and in particular 
the following sequences and their corresponding probabilities: 


Pz(n) = P(B,, Bo, ore B,) 
Pz\(n) = P(B,, Bo, ... Bn, Ans1) 


(5.1) 
Pip(n) = P(A,, Bo, ... By, Bns1) 
Pipi (nr) = P(Ap, By, Ba, .. By, Ans1) 
Since: 
Pip(n) = P(A,, Bo, ... By, Bn41) = P(B2, B3,..,Bn+1) — P(Br, Ba, Bn41) (5.2) 


Pp) (nm) = P(B,, Bz, ... Bn, Angi) = P(By, Bo, Bn) — P(By, Bo,» Bna1) 
while, by virtue of stationarity, P(B2, B3, ... Bn41) = P(Bj, B2,...,B,), we conclude that: 
Pip(n) = Pam) (5.3) 
Special cases of the probabilities of the above event sequences are: 
P,(0) = 1, P3(1) =1-FP,, Pp(2) = 1—2P, + Pyy 
Pp\(0) = Py, Pp\(1) = Py — Par (5.4) 
Pip) (0) = Py 
It is easy to see that: 
Pp(n) = Pg(n) — Pg(n + 1) 


(5.5) 
Pip\(m) = Pg (nm) — Pg(n + 1) = Pp(m) — 2Pp(n + 1)+Pp(n + 2) 
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Assuming that the event A has happened at time 0, if its next occurrence is at time n, we 
can easily derive that: 


P(Ao, B,, Bo, sapien) = Pipi(n — 1) 


Pin =n} = P(B,,B>,...,By_1,A,|Ao) = 5.6 
{n } ( V2 n-1 al o) P(Ay) P, ( ) 
The expected value of n is: 
co 1 co 
E[n] =) nP{n=n}=>) n Pip\(n — 1) (5.7) 
n=0 : n=0 
where using (5.5) we readily see that the sum is evaluated to 1, and thus: 
ae eee 5.8 
n _ D _ P, ( - ) 


This is the standard relationship between return period and probability. The proof we 
have given has not assumed anything but stationarity, so it is quite generic. In particular, 
there is no assumption of independence or any particular type of dependence (see also 
Koutsoyiannis, 2008; Volpi et al. 2015). We stress that, according to the definition, the 
return period is the mean time between consecutive occurrences of the event A. We have 
assumed for the above proof that the event A has happened at time 0 (present time). 

If we do not have any information about the present time, we can derive the mean time 
to the first occurrence of the event A unconditionally, denoted as 7, := DE[n,], where ny 
is the number of time intervals of length D until the next occurrence of A. Its probability 
mass is: 


P{ny = n} = P(By, Bo, ...,Bn—1,4n) = Ppi(n — 1) = Pg(n — 1) — Pp(n) (5.9) 
Thus, its expected value is: 
E[ny| = ». n (Pa(n —1)- P;(n)) (5.10) 
n=0 
and simplifies to: 
(eS 
E[m] ===) Pa(n) (5.11) 
n=0 
In case of independence, Pg(n) = Pg(1)" = (1 — P,)” and thus: 
Ti. A 
Elm] == 5 (5.12) 


or T, =T. If the succession of events A and B is modelled as a Markov chain, from 
Koutsoyiannis (2006a, equation (2)) we find: 


oS 
P,(1) 





Pg(n) = Py(t)( (5.13) 


and hence: 
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ae (P(1))° 7 (1-P,) 
BS p= P,(2)(Pp(1) —P5(2)) (Pa — Pas)(1 — 2P, + Pax) ee 


The difference from T/D, after algebraic manipulations, is: 


Pe = P= 3Pi Pp Pia) 


Elmo] - Ele] =p oe — Py = 2P, + Pa) veal 


For P, < 0.5 (an obvious condition to characterize the event A as dangerous) and P? < 
P,, < P, (meaning positive dependence), it can be verified that 1 — 3P, + P? + P,, > 0. 
Thus E[ny| a E[n] or T, = T. For other types of dependence, assuming B := {x < x}, so 
that P(B) = F(x) and Pzp(n) = P{max(x,, sy Xp) <x Ee we can also evaluate P,(n) from 
the properties of the stochastic process x;and then E[ny| from equation (5.11). But this 
can be laborious. 

Now let us make the calculations again on the condition that the latest occurrence of A 
has been observed at time -m in the past. In this case we have the conditional return 
period T,,, = DE|n|m], which can be determined from the conditional probability: 


Pin = nlm} = P(By, Bo; a: Banti An Bo B45 BantiAan) 
_ P(A_tny Bomats + Bo By, +s Bat) An) ay Pipi(m n= 1) (5.16) 


P(A im Bemis + Bo)) Pg\(m) 


The required expected value is: 


co 1 co 
E[n|m] =) nP{n=nim}=>—— >) n Pipi(m +n — 1) (5.17) 
— Pp\(m) — 
n=0 n=0 
which by virtue of equation (5.5) is written as: 
E[n|m] z 5 n(Pom + 1) — 2Pp(m + n)+P,(m+n+1)) 
n|m| = —.—_—_—.._ ) n m+n-—1)-—- m+n m+n 
- Pg(m) — Pg(m + 1) La _ : 
and reduces to: 
Pa(m 
E[n|m] = eee (5.18) 


~ Pg(m) — Pg(m + 1) 
In case of independence, Pg(m) = Pg(1)™ = (1 — P,)™ and hence: 


Ts A-A™ a 
Elnlm| = "= Gopym a oP BP (5.19) 


or T,, = T, which is expected because in an independent sequence of events conditioning 
on the past does not change anything. In the case of a Markov chain, using (5.13) we 
obtain: 

Pg (1) 1=F, 


T 
ee ae (5.20) 


E[n|m|] = D - P3(1) = Pz(2) > P, — Py, 
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This is independent of m because, due to the Markov property, the past is irrelevant once 
the present is known. The difference from E[nu], after algebraic manipulations, is: 


(1 — P,)(Pr1 — PP) 


(P, — Py1)(1 - 2P, + P,1) (6.21) 


E [nlm] — E[na] = 
It can be verified that when P, < 0.5,P? < P,, < P,(as above), the difference is positive. 
Thus, for m > 0, E[n|m] 2 E[ny| = E[n] or T, 2 Ty, 2T. For m= 0, E[n|0| = E[n], as 
expected. However, for m = 1 the difference can be substantial, as seen in an illustration 
in Figure 5.1, and tends to infinity as P;, > P,. 
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Figure 5.1 Illustration of the different variants of return period for probability of a dangerous 
event A (left) P,; = 0.25 and (right) P,; = 0.01 as a function of the probability if two consecutive 
dangerous events P,,. For T, and T,,, a Markov chain model is assumed. 

Let us now take a further, quite important, step. We first assume that the basic time 
step D, which we used to define the dangerous event A, for example a mean river discharge 
of 1000 m/s, is a day. Up to now, we have tacitly assumed that if the event A, occurs on 
two consecutive days, this constitutes two occurrences of A. But what if we radically 
reduce the time unit to, say, D = 1 min? The theory should also apply in this case. Should 
the occurrence of mean discharge of 1000 m?/s for two consecutive minutes be regarded 
as two occurrences of a dangerous event? And what about if our time step is D=1sor1 
us? A reasonable answer would be that the continuation of a dangerous event for a 
number of consecutive steps should not be regarded an occurrence of a new event. 

This requires some modified analysis. According to this consideration, in the sequence 
B,, Bz, B3, Aq, Bs, Bg, A7, Ag, Ag, Big We have two occurrences of A (A, and A7-Ag-Aj,) rather 
than four. Assuming a long period of length LZ = ID, in which we have n occurrences of A, 
we expect that n= P,l. A number of them will be continuations of the previous 
occurrences, namely, n, = P,,l. Thus, we may write: 


L D 


— = —— 5.22 
n-n Py-Py ( J 


T= 
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We can prove that this result is accurate using formal probability theory. To this aim in 
a sequence of events A and B we replace all but the first A with B, thus forming a sequence 
of modified events A and B.For example the sequence given in the previous paragraph 
becomes 8,,B>,B3, Ay, Bs, B¢,A7, Bg, Bo, Byo. Now if we define P,:= P(A), Py = 
P(Ao, A,), we readily find that their values are: 


P,=P(A)=P,—Pyy, By = P(Ay, A,) = 0 (5.23) 


We can apply all previous results replacing P, and P,, with P, and P,, and find the 
respective quantities for the sequence of modified events A and B. In particular we find 
that the return period is: 

T 1 1 


Y = eS LL SIS OT 5.24 
[7] D P, P, = Pai ( ) 


Comparing with the previous results we infer that for independent processes and Markov 
chains, T > T», > T, > T. In particular, the inequality T > T is obviously valid for any 
process. 

The sequence of modified events AandB cannot be independent, because the 
occurrence of A; excludes the occurrence of A;,,; this signifies negative dependence. 
However, this sequence can be a Markov chain but now, because of the negative 
dependence, the conditional and unconditional return periods will be smaller (7, < T). 
For example, it can be shown (homework) that for m > 0, 7, = T — D. This allows us to 
conjecture that 7 represents an upper bound of all variants of return period, conditional 
and unconditional. If the sequence of events A and B is a Markov chain, then that of 
AandB is not and therefore the conditional T,, should depend on m. Its analytical 
determination can be laborious. Nonetheless, stochastic simulation can readily provide 
its behaviour. A couple of examples are depicted in Figure 5.2, where it can be seen that 
T, = T, while T,,, tends to T,,, as m increases. 

Extreme events that are of interest in geophysics (and, in particular, hydroclimatic 
processes) are usually of two types, highs (e.g. storms and floods) or lows (e.g. droughts). 
In the former case the dangerous event is the exceedance of a certain threshold value x, 
usually related to a failure of a system, operational or structural. In this case the 
dangerous and non-dangerous events are defined as A; := {x; > x}, B= {x; < x}, 
respectively, where x; is a stochastic process quantifying a natural process (e.g. river 


discharge). 
It is interesting to observe that, by virtue of (5.2) (or (5.3)): 
P(A,, Bz) = P(B,, Az) (5.25) 
and since 
P(Ay, Bz) = P(A) — P(Ai, Aa), P(By, Az) = P(B,) — P(B,, Ba) (5.26) 
we obtain 


P(A,) — P(A,, Az) = P(By) — P(By, Bo) (5.27) 
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Figure 5.2 Simulation results for conditional return periods T,,, and T,,, for two Markov chains 
with the indicated characteristics. 


We can use the last property to express the return periods in terms of the distribution 
function of the process x;. In the case that extremes are maxima, we have P, = P(A;) = 
P{x; > x} = F(x) and P(B;) = P{x; = > x} = F(x). Likewise, P,, = P(A;,A2) = P{x, > 
X,X_ > x} = F,(x), where zZi= min(x,, x2), and P(B,,B,) = P{x4 <%,X%2< x} = F,(x), 


where y= max(x Xp). Hence: 


T 1 1 1 T 1 1 1 


DR F@ 1=F@)’ D AoPa F@-E@ Fa-a@ ©) 


where we have taken advantage of the symmetry relationship (5.27). 

In the case that extremes are minima, working as above but interchanging the 
definitions of A; and B; and denoting the return period as T to distinguish it from T, we 
find: 


rt 2 T 1 1 aoe 
D P, F@)’ D P—Py FO — F,(x) 7) 
The identical expression of T for maxima and minima is remarkable. 
In the case of an independent process, F, (x) = (F(x))” and thus: 
r 1 2 1 (5.30) 
D F(x)(1—F(x))  F(x)F(x) 
both for maxima and minima. Furthermore, it can be readily seen that 
T_T wee (5.31) 


—=—-4-—-=-—— 
D D D ODD 


Interestingly, the first term equals both the sum and the product of the two other terms. 
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5.2 Most useful variants of return period 


Recapitulating the results of section 5.1, we distinguish the two most important concepts 
that will be used further: the return period T (for maxima) or T (for minima), and the 
distinct return period T. Their properties, advantages and disadvantages are summarized 
in Table 5.1. 


Table 5.1 Properties of the two most important variants of return period. 
Property Return period T or r Distinct return period T 
Mean time between consecutive 
distinct exceedances or non- 








Mean time between consecutive 
exceedances (for maxima) or non- 








Definition — exceedances of a threshold value x, 
eee or uneiia hens (after an interruption by an opposite 
threshold value x. P y PP 

event). 
T 1 1 
For maxima: — = =~ = = 
Equation to D F(x) 1-F(x) Pe 1 _ 1 
deri T 4 D F(x)-E(x) F(X)—-F 
mad For minima: — = —— F(x) — E(x) (x) — F,(@) 
D F(x) 





Order-two distribution F (x1, x2) or 


Requirements to 
4 Marginal distribution, F (x) marginal distributions of x and the 








derive i ? 
maximum of two consecutive x. 
One-to-one correspondence with Symmetry with respect to maxima 
Relationship with . P . y i y P 
F(x) F(x) but different for maxima and and minima; one-to-two 
x 
minima correspondence with F(x). 
. Better behaviour for multiscale 
Discrete vs. ae ' ee : Mas 
Can only work in discrete time. description, offering extensibility to 


continuous time . . 
continuous time. 





Estimation from Easy (and potentially unbiased) 


' : ae ; ; (Not yet explored) 
time series estimation from time series. 





In addition, we introduce the concept of the excess return period defined as the 
differences of return period from the time unit, i-e., the quantities: 


T=D,. T=D (5.32) 


These are particularly useful in probability plots (see Digression 5.A) as both range in 
(0, 00) (while both T and T take values >D). These quantities have the following properties 
of symmetry: 


TaD FG) T=D. FO) 


>= Fa’ eC, (T —D)(T-D) =D? (5.33) 


Digression 5.A: Visualizing probability through return periods 


It has been a common practice in geophysics and even more so in engineering applications to use 
the return period T, instead of the distribution function F, for probability plots. Representative 
examples are given in the left column of Figure 5.3 for several distribution functions. In both 
panels the return period is on the horizontal axis, which is logarithmic. The distribution quantile 
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is shown on the vertical axis on a linear axis (upper panel) or on a logarithmic axis (lower panel). 
The former option is good for light-tailed distributions such as normal and exponential. Heavy- 
tailed distributions are better depicted on the double logarithmic plot, on which the slope on the 
right equals the (upper-)tail index. If we also wish to visualize the lower-tail index, we should 
replace the return period, T, with the excess return period, T — D. In this case, as shown in the 
right column of Figure 5.3, the right part of the distribution is not affected, but the left is 
substantially changed, so that the slope on the left on the log-log plot (lower right panel) equals 
1/¢, where ¢ is the lower-tail index. The values of the asymptotic slopes, left and right, for the 
excess return period plots and for several common distributions are shown in Table 5.2. 
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Figure 5.3 Variants of probability plots using the return period (left column) or the excess return period 
(right column) instead of the distribution function. 


Table 5.2 Asymptotic slopes of the plots of distribution quantiles vs. excess return periods for several 
common distributions (for the definition of the distributions and their parameters see Table 2.5). 








Distribution Left slope Right slope Left slope, log-log Right slope, log-log 
i dx , dx 7 dInx i din x 
t-bdin(T —D) TodIn(T—D)  T-Ddln(T —D) tow din(T — D) 

Exponential 0 A 1 0 

Normal 0 0 0 0 

Lognormal 0 00 0 0 

Gamma 0 A 1/¢ 0 

Weibull 0 00 1/¢ 0 

Pareto 0 00 1 & 

PBF 0 00 U/¢ € 
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5.3 Reliability and probability of failure 


Amongst the probabilities of sequences of events that have been introduced and 
discussed, most important is the probability of zero dangerous events ina period ofn time 
steps, termed reliability: 


Pg(n) = P(B,, B2,... By) = P{x, Se ea Ny x} = F, (x) (5.34) 
where yp, := max(X1, eye) Its complement from one, i.e. 
Pp(n) = 1— Pg(n) = Fy, (x) (5.35) 


is called probability of failure and is equally important. This is also known in the literature 
as risk (e.g. Chow et al., 1988) or risk of failure (Serinaldi, 2015). However, here we 
preferred the more accurate term probability of failure instead of risk, because the latter 
term has acquired a broader meaning, incorporating, in addition to probability, exposure 
and vulnerability (Kron et al. 2019; see also Chapter 11). 

The probability of failure is a concept best suited for design studies and risk 
assessments. However, it is not easy to handle as it needs the derivation of the any-order 
distribution of a stochastic process or at least the marginal distributions of maxima of any 
order—and, as we have seen in Digression 2.J, the convergence of this distribution to the 
related asymptote is too slow. For that reason, the design and risk assessment studies, as 
well as the legislation related to management of extreme events (e.g. the European Flood 
Directive; European Commission, 2007) are more commonly based on the concept of 
return period whose evaluation only needs the marginal distribution (in the standard 
variant) or the second-order distribution at most (in the variant of the distinct return 
period). Common values adopted in engineering design (depending on the importance of 
the structure and the consequences in case of failure) are shown in Table 5.3. 

Table 5.3 Return periods (T) most commonly used in engineering design for high flows and 


corresponding exceedance probability (F, equal to the probability of occurrence of a dangerous 
event P,), and non-exceedance probability (F). The time unit is assumed D = 1 year. 


T (years) fF =P; F T (years) F =P, F 
2 0.50 0.50 500 0.002 0.998 
5 0.20 0.80 1000 0.001 0.999 
10 0.10 0.90 5 000 0.0002 0.9998 
20 0.05 0.95 10 000 0.00001 0.9999 
50 0.02 0.98 50 000 0.000002 0.99998 
100 0.01 0.99 100 000 0.000001 0.99999 


Note: If the dangerous event is alow one (e.g., low flow, low temperature), we must interchange the columns 
Fand F of the table. 


5.4 Relationship of probability of failure and return period 


It is easy to see that in case that the process of interest is independent in time, the 
following relationship holds true: 
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1 — Pp(n) = Pg(n) = (1 — Py)” (5.36) 


Thus, if we specify a length n (e.g., nD is the design life span of a project) and the 
probability of failure Pp(n), then the design return period is 


T 1 1 
DR i=Ganay (537) 


This is a standard relationship used in hydrological design. 
In a process with dependence in time we can modify this relationship by introducing 
the notion of equivalent length n’: 


1 1 


a eee ea ron 


SI 


where obviously n’ = n for the case of independence. Solved for the probability of failure, 
this yields: 


n' 


Pp(n) =1-(1-P,)™ =1- (1 = >) (5.39) 


For a Markov chain we have 





P 2 n-1 n! 
Pp(n) = Pg) fae = (P5(1)) (5.40) 
so that 
a =1+(7-1)@-1) po ED) (5.41) 
¢ In(P;(2)) | 


If ¢ = 1/2, then we recover n’ = n, the case of independence. 
Koutsoyiannis (2006a), based on maximum entropy considerations, introduced the 
quasi-Markov (but in essence non-Markov) structure in which: 


1— Pe(n) = Pp(n) = (1 — Pte -1)(n-1)). (5.42) 

In this case: 
n’ = (1+ (6-49 -1)(n- 1) (5.43) 
This is a two-parameter relationship with each of the parameters ¢, 6 ranging in (0,1). It 


can readily be seen that the Markov chain is a special case attained when @ = 1. Iliopoulou 
and Koutsoyiannis (2019), based on a systematic simulation study, have shown that it can 
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effectively describe an HK process x; for several marginal distributions F(x) and 
thresholds x. 

For processes with positive dependence we will have n’ < n and the resulting return 
period will be smaller than in the case of independence (see Figure 5.4). Thus, if we design 
by specifying the probability of failure and we neglect the existing dependence of the 
process, our design will be on the safe side in terms of the resulting return period. 
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Figure 5.4 Probability of failure as a function of the probability of occurrence ofa dangerous event 
P; or the return period T for a period nD where n = 100, and for different stochastic structures of 
the underlying process. 


Digression 5.B: Avoiding misuses of return period 


We have insisted in our discussion of the return period on the fact that it is a dimensional quantity 
with units of time. The notation should be consistent with this fact. It is very common to express 
the return period in years; it is also common to analyse annual time series, i.e. with D = 1 year. 
Sometimes dimensionality is forgotten, essentially identifying T with E[n], or treating the return 
period as the reciprocal of the exceedance probability. However, this is not dimensionally 
consistent and not general enough (it does not cover the case of low flows). 

Perhaps the word period in the term return period is not quite proper as it may mislead people 
to imply that there is some periodic behaviour in consecutive occurrences of events such as in 
exceedance or non-exceedances of threshold values in nature. In a stochastic process the time 
between consecutive occurrences of the event is a stochastic variable whose mean is the return 
period, 7. For example, if the value 500 m3/s of the annual maximum discharge in a river has a 
return period of 50 years, this does not mean that this value would be exceeded periodically once 
every 50 years. Rather it means that the average time between consecutive exceedances will be 
50 years. An alternative term that has been used to avoid “period” is recurrence interval. 
However, sometimes (e.g. in Chow et al., 1988) this term has been given the meaning of the 
stochastic variable nD and not its mean T. Also, the notion of the return period should not be 
thought of as a time period of the real world. For example, as seen in Table 5.3, engineers use 
return periods that in major constructions can be > 10 000 years. One should not compare such 
return periods with real durations, e.g. with the duration of the Holocene. 

Nonetheless, the term return period by now has more than 100 years of history (Volpi et al., 
2015) and we have kept it also in this text, despite the above caveats. 
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Digression 5.C: Approximation for the extremes of the normal distribution 


The normal distribution is a key model with a wide spectrum of applications. Its usefulness stems 
on the one hand from theoretical reasons (Central Limit Theory, Principle of Maximum Entropy) 
and on the other hand from the simplicity of its handling, particularly when dealing with multiple 
dependent variables. Also, the fact that time averaging of a stochastic process with normal 
distribution preserves the normal distribution, makes the process easiest to handle in discrete 
time. However, its behaviour that is related to extremes is not easy to handle in terms of precise 
analytical equations. Therefore, here we derive several approximations, useful for application. 

First, in Appendix 5-I we derive the following approximation for Fy (x), the standard normal 
distribution function, N(0,1): 


Fy(x) = (5.44) 


eae) 20 
Sie Be 3% xs 


By inverting it, the quantile function is: 


=( [1 — 4In(2(1 - F)) - 1), Ped/2 
4 (5.45) 


-3 (fi 4n@F - 1), x<1/2 


As seen in Figure 5.5, the approximation is close to accurate. However, it is noted that the 
approximation is not consistent with the limiting behaviour on the distribution tail. A better 
approximation for the tail, in particular for |x| = 6, is given by a well-known approximation: 
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Fy (x) * x2 (5.46) 
_ £40) (| —— J], x<-6 
V2Tt x ( 2 
This is consistent with the distribution tail, which means that: 
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Figure 5.5 Comparison of the approximation of the normal distribution by equation (5.44) with the exact 
normal distribution. 
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Furthermore, in Appendix 5-II we find an approximation of the distribution function of the 
maximum of two correlated variables with standard normal distribution and correlation 
coefficient r. Denoting the maximum as y := max(X1, ro) its distribution function is: 


m?*\ Fy (—sly| — m) (5.48) 
> a a 


ne 1-r Bgl eer ee) 
ier ee ey, 


Illustration of the achieved approximation is given in Figure 5.6. 
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Figure 5.6 Comparison of the approximate and the exact distribution function of the maximum of two 
correlated variables with standard normal distribution and correlation coefficient r. The approximate 
distribution function is given by equation (5.48) and the exact is calculated by numerical integration of its 
density given in Appendix 5-II. The plots are of the ratios F,(y)/Fy(y)*, where Fy(y)* is the distribution 
function of the maximum of two uncorrelated variables. 


5.5 Return period and time scale 


The analyses in section 5.1 have been based on a specified discrete time step, D, while 
reliability and probability of failure are defined in terms of a second characteristic time, 
the project life span L = nD. The study of extremes, thus, involves two operations on an 
instantaneous stochastic process x(t): taking the time average (for discretization on a 
time step D) and taking the maximum (over the period L). If we change the time step D 
the results may change and this signifies a problem as the choice of the time step is 
subjective, determined from the available time resolution of measurements or from 
modelling conventions. In order to make our analyses more objective we need proper 
transformations in descriptions among different time scales and also an analysis of the 
behaviour as time scale tends to zero. We note that the study of extremes in continuous 
time is a separate scientific field, the /evel crossing theory (e.g. Brill, 2017) but this may 
not be well suited for hydroclimatic extremes that are typically studied on time-averaged 
processes. 


152 CHAPTER 5 - RETURN PERIOD 


Replacing the fixed time constant D in equation (5.28) with a varying time scale k, we 
obtain the following expression for the return period T“ and the distinct return period 
T at time scale k: 


k k 


CS ae (5.50) 


Te) = ——______ 
b= ROR) F(x) — BY (x) 


where F“)(x) := Pia” < x}, EG) = P {y® < x}, y© = miax( xs %ja5” and x) is 
the time-averaged process at time scale k. Now if we assume that the instantaneous 
process x; has finite variance and is fully smooth (differentiable), it is reasonable to 
assume that a certain threshold x that is representative of a dangerous event at a certain 
small time scale ko will also be representative for any smaller time scale k < ky. Ask — 0, 
the distribution function F“)(x) will tend to the one of the instantaneous process, F(x). 
Thus, according to equation (5.50), the return period of the fixed threshold value x will 
vary in proportion to the time scale k and will be precisely zero for the instantaneous 
process (k — 0), irrespective of the threshold value x. This looks paradoxical but it is a 
precise result, which suggests that the return period T is not a proper index to move 
across time scales and to study the instantaneous process. On the other hand, if it happens 


that BG) = F(x) —kC, for some constant C, then it is possible to have a constant 


distinct return period T“ for a range of time scales, including the instantaneous one. 

Before we continue on the latter issue, it is useful to further discuss the smoothness of 
a process. In section 3.8 we have introduced the smoothness (or roughness or fractality) 
parameter M, 0 < M < 1 and we noted that the value M = 1/2 signifies neutrality while 
lower values denote a rough process and higher values denote a smooth process with the 
value M = 1 corresponding to full smoothness. In the latter case the process is (mean- 
square) differentiable, a property meaning that the first and second derivatives of the 
autocovariance function exist (Papoulis, 1991, p. 337, 606); in particular the first 
derivative at zero should be c’ = 0 (because c(h) is even). 

We will provide some illustrations using the FHK-C process whose climacogram is 
given in (3.87). It can be verified that for M = 1 the climacogram and the autocovariance 
function are: 


1+ (SH — 3)(h/a)? + H(2H — 1)(h/a)* 


(+ (hay oe 


y(k) = c(h) =A 


(1 + (k/a)?)-#" 
Further, it can be verified that the first two derivatives exist and c'(0) = 0 and c’’(0) = 
—12A(1 — H). We note that for M < 1 these derivatives do not exist at h = 0. Figure 5.7 
depicts traces of the FHK-C process for three cases, the fully smooth, the neutral and a 
rough one. The meaning of the smoothness can better be realized by comparing the three 
traces. 
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50 


T 


Figure 5.7 Illustration of traces from the FHK-C process for the indicated values of M and for H = 
0.8, a = 10 and A = 1; the same white noise sequence with normal distribution was used for all 
three cases and were transformed to FHK-C series by SMA filtering (see Chapter 7 for the details 
of the latter). 


Having clarified the meaning of smoothness, we proceed to illustrate the relationships 
of the threshold x, the return period T and the distinct return period 7 as time scale 
k varies. For this illustration we use the normal distribution with zero mean, as it is the 
easiest to handle theoretically, given that it is preserved at any time scale of averaging. 
We approach the distribution of the maximum of two consecutive variables, y“) := 


max(x} (i or), by the approximation given in Digression 5.C and, in particular, in 
equation (5.48). We note that T“ has lower mathematical limit corresponding to 
F(x) = 1, whichis 

TO) ok (5.52) 


min 


we. otk) 
However, under the condition F (x) < 1/2 to characterize the event A as dangerous, the 


_ and TO Fo @e : 
lower limits of both T“and T“ correspond to F (x) = 1/2 and are: 


2 k 
T® =2% FOx 


fi os eT ON 5.53 
min min 1/2 _ Bocas) ( ) 


where x2 is the median. 

We set forward the hypothesis that in the transformations among scales, what should 
remain constant is the distinct return period T“?). Starting with a fully smooth process, 
for which, as explained, it is reasonable to fix the threshold value for small scales, we 
observe in the panels of the upper row in Figure 5.8 that our hypothesis works: a constant 
T) indeed yields a constant threshold x. However, if the process is not fully smooth, a 
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constant x for a changing scale is not a desideratum. Indeed, in a rough process, if we 
increase the time scale, the averaged process is less varying and thus the threshold value 
should be chosen smaller. Actually, this is the behaviour shown in the middle and lower 
row in Figure 5.8, and this confirms the reasonability of the hypothesis. Figure 5.9 shows 
the effect of the time scale parameter a in the variation with time scale k of the threshold 
x, the return period T“) and the distinct return period T“). Here the model parameters 
except a were kept constant and the process studied is neutral in terms of smoothness (M 
= 0.5). The behaviour looks similar as in Figure 5.8. 

Both figures suggest that there is an “optimal” time scale for definition of return period, 
in which the difference of T“)and T™ is minimum. This optimal time scale is of the order 
of magnitude of (but not exactly equal to) the model parameter a. If we specify the design 
return period at this scale, then the choice between T or T makes no substantial 
difference. But far from this scale, misspecification of the variant of return period that is 
used may have a dramatic effect. In particular, it is totally inappropriate to specify a design 
return period for a time scale much larger (say by an order of magnitude) than the optimal 
as in this case it will not represent anything related to the project safety but just the lower 
limit T = 2k. 


5.6 Sample estimation of return period 


In an observed sample of size n, the ith smallest value is an estimate of the nth order 
statistic, X(j.n). What is the estimate of the return period of this value? 

This question is very important in studying extremes, and its answer is particularly 
useful for the highest values in the sample. From the analysis on order statistics (section 
4.12) we recall that the stochastic variable u := F (Xcizny) has beta distribution with 
parameters i and n—i+1. From well-known results for the Beta distribution, the 
expected value is: 


Elu] = E[F(xcm)] == 7 ; (5.54) 


It is remarkable that this does not depend of the distribution function F(). Then an 





estimate of the return period T(.n) = T (xcin)) could be: 
Tii:n) = 1 _ n+1 5.55 
D  T-EFEm)] FI wae 


Hence, the return period estimate for the highest value, the maximum Tp) = iu Caer F is 


given as: 


(n) 
—_=n+1 5.56 
n ( ) 
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Figure 5.8 Illustration of the variation of the threshold value x“ and the return period T™) vs. 
the time scale k for constant 7“) = 100 (left column) and 7“ = 1000 (right column) and for a 
fully smooth process (M = 1, upper row), a neutral process (M = 0.5, middle row) and a rough 
process (M = 0.1, lower row). In all cases the model is FHK-C with H = 0.8 and a = 1, and with 
standard normal distribution for the instantaneous process (A = 1). Dotted lines represent the 
lowest feasible values of T” and 7, when P;= 0.5 and x = 0. 
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Figure 5.9 Illustration of the variation of the threshold value x“) and the return period T (1) ys, 
the time scale k for constant T“ = 100 (left column) and 7) = 1000 (right column) and for a 
neutral FHK-C process with standard normal distribution for the instantaneous process (A= 1), M 
= 0.5, H=0.8 and a= 0.1 (close to pure randomness; upper row), a = 1 (middle row) and a = 10 
(high short-range dependence; lower row). Dotted lines represent the lowest feasible values of 
T® and T™, when P;= 0.5 and x = 0. 
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Equation (5.55) constitutes the most widely known and the popular way of assigning 
return periods to sample values. It is known as the Weibull plotting position (Weibull, 
1939). However, it is not the earliest, as Hazen (1914) had proposed a different formula, 
which looks similar but as far as the maximum observation is concerned, the difference in 
the assigned return period is dramatic, at a ratio of 2:1. Several other formulae have been 
proposed in the 20 century, which are listed in Table 5.4, along with their basic 
characteristics. 


Table 5.4 Alternative formulae of plotting positions (in chronological order). 























Name Tuimy/D Ty /D Comments 
n 
Hazen (1914) ari 5 2n Empirical 
; n+l Distribution free, unbiased for 
Weibull (1939) ——_—_. n+1 
n+1-i F(x¢:ny) 
n+1/4 Approximately unbiased quantiles for 
Brom ee) n+5/8—-i layer e/s normal distribution 
aeevies n+1/3 3/2)n + 1/2 Distribution free, preserving median 
mney ) n+2/3-i aaa of F(x¢i-n)) (see text) 
+ 0.12 A imately unbiased tiles fe 
Gringorten (1963). “A eno at Pe ee een en ror 
n+ 0.56 —i EV1 distribution 
n+1/5 Compromise for approximately 
Cunnane (1978) — aoe (5/3)n+1/3 unbiased quantiles for many 
n+3/5—-i ee ee 
distributions 





All these relationships are of the form: 
Tim) nt+B 
D  n-itA 
Also, all have asymmetry property. Namely, for the central element i = m + 1 ofasample 
with size n = 2m + 1 they yield a return period T/D = 2. In that case, (5.57) yields 2 = 


(2m +1+B)/(2m+1+A—m-—1), from which we find that the symmetry property 
requires: 


(5.57) 


peor (5.58) 


All formulae of Table 5.4 satisfy it. 

Which of these different formulae should we follow? If we are interested in the small 
and intermediate items of the observed sample, all formulae give about the same results. 
But if we are interested in the largest values and particularly the maximum, then the 
differences are dramatic and the question is crucial. A first element of the answer is that 
we should avoid the Weibull formula. Certainly, it has some advantages, such as its 
extreme simplicity, its theoretical justification (which was provided after it was 
introduced) and the fact that it is distribution free. On the other hand, considering its 
theoretical justification, it does not make much sense to seek an unbiased estimate of 
F (Xun) while we aim to assign a return period to a certain value. Indeed, the Weibull 


formula provides an unbiased estimate of F (Xun), as well as of the exceedance 
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probability 1 — F(xuny), but the estimate of the return period T(j.n) given by equation 
(5.55) if far from unbiased, because of the nonlinearity of the latter equation. 

The Tukey formula is also distribution free, and has the advantage that preserves the 
median of F(x¢iny) and any of its transformations, including 7(;.,). The median of the beta 
distribution of the variable u := F(Xc:ny) with parameters iandn —i+1is Ii (in-—it 
1), where I~1( ) is the inverse Beta regularized function. Acommon approximation of that 
median is (i — 1/3)/(n — 1/3), while for i = n the median is 2~”. Based on these, we can 





write: 
te 1/3 oe 
Tam) _ 1 _ Set 2/3! ee (5.59) 
D 1 — median|F (xin) )] gue 
ae i=" 


Even using the upper of the two equations also for i=n, the error is small, 
(3/2) 1In2 — 1 = 0.04 at most. We can adapt the constants in (5.59) according to (5.57)- 
(5.58) to make it simpler with a negligible additional error. Namely, if we replace the 
constant 2/3 in the denominator with In 2 = 0.693 and calculate the constant in the 
numerator by (5.58), the result is the following equation which is asymptotically exact for 
i=n: 
Tin) _ 1 _nt+2iIn2-1  n+0.386 

D ~ 1—median[F(xiiny))} nm—-itm2  n—it+0.693 





(5.60) 


Next, we seek unbiased estimates for the return period raised to a power €, namely of 
the expectation of the stochastic variable: 


as 
g _ = 
» = Keim /DY = 1 _ (1-F@am)) -1 (5.61) 
- 5 3 
The case € = —1 corresponds to the unbiased estimation of F (xan), which is already 


discussed. The case € = 1 corresponds to the unbiased estimation of the return period. 

The case € = 0 corresponds to the unbiased estimation of the logarithm of return period 

Thus, this general setting allows studying several cases at once for € varying in [—1,1]. 
AS U:= F (Xe:ny) has Beta distribution with parameters i and n—i+1, the same 


distribution will have the variable 1 — (é les G eae Thus, from (4.62) it follows that: 


Bi _cevtr)-valin —ict+ 1) 


B(ii,n —i +1) 82) 


Fj) = 
After tedious algebraic manipulations, which are omitted, the expected value of v is found 
to be: 


- JAG EG tie) AT 


l= r@ai-OIGeia x ee) 


Hence, for v = E[v], the estimate of the return period is 
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T(xem) _ (T+ @+1-i-p\" (5.64) 
D r(n+1-&)m+1-i) 
For integer € this can be simplified. Specifically, for € = —1 it becomes: 
T(xan)_ n+ (5.65) 
D ip L—i 
for€é = 0: 
T(xany) __e”™ (5.66) 
D eln-i 
where H,, := i, 1/i the nth harmonic number, and for é = 1: 
T(xuny)_ 2 (5.67) 
D n=1 


Illustration of results for other values of € in the interval [—1,1] is provided in Figure 
5.10, where it is evident that assignment of return periods for the smallest half of the 
sample is indifferent to the choice of € but as we approach the largest value, i = n, the 
effect of becomes dramatic. In particular, the option € = 1 (unbiased T) is unable to assign 
a return period to the largest value. Following this option practically means that we have 
to discard the largest value and assign a return period nD to the second largest. The 
reason is that equation (5.67) diverges for i = n. The divergence does not make this 
option an appropriate choice, while for reasons explained above, the option é = -1 is also 
inappropriate. 
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Figure 5.10 [Illustration of the variation with the tail index € of the return period assigned by 
equation (5.64) to the ith smallest value in a sample of size n for the indicated values of i and for 
n = 10, 100 and 1000 for the left, middle and right panel, respectively. 


In Figure 5.10 we may also see that the option € = 0, corresponding to an unbiased 
estimate of the logarithm of return period, is most promising and balanced and therefore 
we can proceed with this. Even though the equation (5.66) (for € = 0) is easy to evaluate, 
we can also make an approximation in the form (5.57), which is more practical, and 
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accurate enough. To this aim, we proceed as follows. We consider the case i = n and from 
(5.57) we find 


9) SA = ; + — (5.68) 
where we have simplified the notation as T(,) = T(n.n). Taking the limit as n > o we 
obtain: 

1 1 
Ag = lim ———— = — (5.69) 


a (1 = F(xny)) 
In the other extreme case, n = 1, setting x(1) = p (the mean), we similarly find: 


1 1+B 


The coefficients A,. and A, will be better explained in section 6.14. For now, it suffices to 
determine A and B by solving the system of the last two equations, i.e., 


ye poe it earl 
“Sls a ey) 


From (5.66), A; =e = 2.718 and A, = eY = 1.781, where y=0.5772 is the Euler 
constant. Thus, the approximation sought is: 

T (xci:n)) _ nt eY-1 _ _nt0.526 (5.72) 

D n-iteY n—it+0.561 

As n > ©, while (5.72) is exact for Pan); it underestimates the return period of the 

second largest value, TXGeis by a factor: 
e 
1+eY 





—1 = -0,023 (5.73) 


Nonetheless, as i gets smaller, the error in determining T (xci:n)) becomes zero. 

We can also formulate a slightly better approximation, by distinguishing the equation 
for the maximum value from all others. That is: 
Tass) eY(n — 1) ve = 1.781n+ 0.94, i=n 

pie n—-1+e 7 oe (5.74) 
n—-i-1+elyY 

where y = 0.5772 is the Euler constant. This is asymptotically exact, as n > ©, for both 
the largest and the second largest sample values. Notice that neither the accurate equation 
(5.66) nor the approximations (5.72)-(5.74) are symmetric; namely for n = 2m + 1 and 
i=m+1 they result in return period slightly higher than 2D (as demanded by 
symmetry), which tends to 2D as n > 00. Consequently, the equation of symmetry (5.58) 
does not hold. 

With the same methodology, appropriately modified, we can determine unbiased 
estimates for other quantities, for example distribution quantiles. In fact, no adaptation is 
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needed for the exponential and Pareto distribution as the variable v in equation (5.61) 
also represents the quantiles of those distributions. For the normal distribution we have 
Aw = eY = 1.781 (same as above for € = 1) and A, = 2 (due to symmetry). The resulting 
formula is contained in Table 5.5, along with a summary of all above results. 


Table 5.5 Suggested formulae of sample estimation of return period (plotting position). 





No. Tei:ny/D T(ny/D Preserved quantity 
n+2In2-1  n+0.386 (n-—1)/In2+2 Median of F(X¢in), F(xcny), 
n-itIn2 n—i+0.693 =1.443n+0.56 T(-n), of any distribution 
Logarithm of T(j.n) of any 








H. 
eon Hn 


II e 





distribution; quantile x(j.n) of 
eAn-i Sine 
exponential distribution (Exact) 





Approximation of II; also valid 
weery —4 n+0.526 eY(n—1)+e for the quantile x(j.,) of 
n—iteY  n—it+0.561 =1.781n+0.94 positively skewed distributions 

of the EV1 domain of attraction 


Il 





Quantile x(j.,) of normal 
n+2eY-—1 — n+0.123 eY(n—1)+2 distribution; also valid for 
n-iteY  n—i+0.561 =1.781n+0.22 symmetrical distributions of the 
EV1 domain of attraction 


IV 





Power of T(;.n) to exponent é for 


ry VF 
See (nB(n, i= ge)" $ any distribution; quantile x¢j.n) 
CER ee ak of Pareto distribution (Exact) 





B)/A, e.g., ; ; : 
tae is : fe oe Approximation of V; also valid 
n-itA = U.to: Eee aE 
for distributions of the EV2 
VI -1/§ 2.035n + 0.92 
A= (rd — é)) F= 05: " domain of attraction with tail 
-1/é aa oattats ; 
BAT 6) sd 3.142n +0.86 intexs: 





As a final suggestion following from the above detailed analysis, equation (5.74)—case 
III (or if theoretical rigour is sought, (5.66)—case II) is the most appropriate for practical 
use for any distribution function. It represents unbiasedness in the estimation of the 
logarithm of return period which is a balance between unbiasedness in distribution 
function and in return period. In the particular case of the exponential distribution, it also 
provides unbiased quantile estimation. Furthermore, as seen in Table 5.5, for all 
distributions of the domain of attraction of EV1, the return period of the largest value is 
1.781 times n (plus a constant), while in those of EV2 it is even higher. Having this in mind 
and observing the traditional formulae of Table 5.4 only the Hazen and Gringorten 
formulae satisfy this condition. Interestingly, the Gringorten formula is virtually the same 
as case IV of Table 5.5. Note though that case IV was derived here for the normal 
distribution and not for the EV1 distribution, for which here we suggest the, slightly 
different, case III of Table 5.5. Thus, if one wishes to use the traditional formulae, the 
Gringorten formula is preferable. We may also notice that the Hazen formula is not 
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unjustified in light of the above analysis. Rather it appears equivalent to case VI of Table 
5.5 which is for the Pareto distribution with €= 0.15. 

The fact that case III, as an unbiased estimator of the logarithm of return period, is 
distribution free is perhaps its most significant advantage. Usually, in practice we need to 
assign return periods to sample values prior to choosing a model and, in this respect, case 
III is an advantageous choice. Once we choose a model, we may reconsider the assignment 
of return periods using more accurate methods and formulae, which are based on K- 
moments that are discussed in Chapter 6 and particularly section 6.14. The latter analyses 
also consider other effects, such as the possible existence of time dependence, which 
influences the estimation of return periods. 


Digression 5.D: Illustration of the range of sample estimates of return 
period 


A stochastic simulation always helps to develop better intuition of theoretical concepts and spot 
possible errors. Here we perform a simulation exercise generating m = 10 000 random samples, 
each consisting of n = 10 values from the Pareto distribution with tail index € = 0.5. We 
intentionally choose a small n and a large € (note that for this € the variance of the variable 
diverges to infinity) for better illustration. For the chosen Pareto distribution of the parent 
stochastic variable x we have: 


a T(x) 1 
any 


The theoretical return period of the ith order statistic x(;.n) is determined from equation (4.62), 
ie., 


Ane 


= (1 + €x) 


Tan) (x) _ Bin —it 1) 
D = - BGn-i+1)-BryGn-i+) 


For the special cases of the minimum and maximum order statistic we have, respectively, 


GONE D D\ 
Tan) (x) = (=) , 1- Toot) = (1 ms a 


where in our case n = 10. 

The theoretical curves T(x) for the parent variable and for the largest and second largest order 
Statistics, X(10) = X(10:10) and xX(9.19) have been plotted in the left panel of Figure 5.11. An 
interesting observation is that the curve X(p_.,) crosses that of x but not that of X(p.,). Actually, 
this happens at all i = 2, ...,n — 1 and for large n the intersection point corresponds to: 


Tim) n-1 
D C= 1. 
For each of the 10 000 samples the following values were evaluated 


x0» F(x),  T(xao), IT (xa0)) 


and the same list for (9.19) too. The empirical (sample) distributions of x, x(49) and X(9.19) are also 
plotted in the left panel of Figure 5.11 in the form of return period plots. They were estimated 
from equation (5.72) (option III of Table 5.5). Generally, the empirical plots show good agreement 
with the theoretical ones, thus indicating the consistency of the proposed framework. 

Next the averages and the medians from all m = 10 000 values of the above list of variables 
were calculated. For x(19) the averages and the medians are plotted in the right panel of Figure 
5.11, over the curve T(x) of the parent variable. In all cases these empirical estimates fully agree 
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(the points coincide) with the theoretical estimates, with one exception: The theoretical average 
of T(x(10)) is co, while the sample estimate is necessarily a finite value, yet a very large one 
(>100D). Generally, the plot shows that the differences between the different options for 
assigning return period to the largest value (represented by the different points) can be 
substantial. The points corresponding to In T (x(10)) and x19) look the more balanced choices, 
thus confirming what has already been discussed. 

We must note that the differences would be less dramatic if the sample size was greater or if 
the distribution had a smaller tail index. 
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Figure 5.11 Simulation results from m = 10 000 random samples, each consisting of n = 10 values, from the 
Pareto distribution with tail index € = 0.5: (left) Theoretical and empirical (sample) distributions of the 
indicated variables, where theoretical distributions were determined from equation (4.62) and empirical 
distributions from (5.72) (option III of Table 5.5). (right) Simulated averages from the m = 10 000 values 
of the indicated variables. The median, also plotted, is indifferent to the choice of the variable. 


Digression 5.E: A fun way to calculate 1 through the properties of maxima 


The curious reader may have noticed that the formula given in Table 5.5 (case VI) for the return 
period of the maximum value of the Pareto distribution with € = 0.5 contains the value of t. More 
rigorously, the formula in this case is 


To@y/D =1(n—1)+4 


For n = 2 T,2)/Dbecomes t + 4 (rather than 3, a value that would be assigned if we adopted the 
Weibull plotting position formula). Solving for 1 we find 


iy oe 
~ (n-1) 


This enables a Monte Carlo technique to calculate t. Interestingly, ideas that could be classified 
as implementation of the Monte Carlo method to calculate m are much older than the formal Monte 
Carlo method. Georges Louis LeClerc (Comte de Buffon, French scientist; 1707-1788) became 
famous for “Buffon’s needle,” a method using needle tosses onto a lined background to estimate 
Tt (where, ifthe line distance is equal to needle length, tis found as twice the inverse of probability 
that the needle crosses a line). LeClerc’s method became popular among scientists and his 
experiment was later repeated by many. However, his method and many other Monte Carlo 
algorithms to estimate mt, including the one presented here, are good only for fun. Much faster and 
much more accurate deterministic algorithms exist to calculate m. Reitwiesner (1950) calculated 
by a deterministic algorithm, running on the ENIAC computer, the first 2035 decimal digits of Tr. 
Metropolis et al. (1950) examined their randomness, an exercise made thereafter many times 
showing that the digits of m have no apparent pattern and pass tests for statistical randomness. 
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Dodge (1996) promoted an idea opposite to LeClerc’s: that the digits of m form a “Natural Random 
Number Generator”. Since January 2019, 31.4 trillion digits of m are known (found by the 
Chudnovsky algorithm); this information, equivalent to ~100 million books of 1000 pages each 
(note for comparison that the British Library has 25 million books), can serve as a basis for any 
simulation experiment. However, the simple random generators discussed in section 2.6 are more 
economic and convenient. 

After this historical note, we return to our method to estimate 1 from the maxima of the Pareto 
distribution. The equation above, allows formulating the following algorithm. 


1. We generate n random numbers from the Pareto distribution with € = 0.5 and take the 
maximum. 

2. We repeat this procedure m times and calculate the average of all m maxima. 

3. We find the return period of this average from the theoretical relationship of the Pareto 
distribution and calculate mt from the above equation. 


For step 1 we note that a random number from the Pareto distribution with € = 0.5 is 


generated from 
(= - *) ( il 1) 
a ———__ = — — 
5 vu 


where u is arandom number from the uniform distribution. The maximum of n random numbers 
is thus 


Xn) = Max(Xj,..., Xn) = 2( 


<5") 
| 
min(wy, ...., Up) 


F(m) 
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Figure 5.12 Simulation results to estimate m from m = 100000 random samples, each consisting of n 
random numbers from the Pareto distribution with tail index € = 0.5. The curves depict the sampling 
distributions of mt estimates for the indicated three values of n. 


For step 2, the average of m simulated Xn), each denoted as Say i = Ih, ony ity, 1S 


m 
- 1 1 
Xm) = 2| — ) —————————- 1 


Me i=1 [min(ui, era it) 


For step 3, the return period of the average is 
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1 m 


iv Fa Ko" il 1 
alia (1+ %) = (1+-2) = —) ——_ 
d 2 m ; j : 
i=1 min(u}, Bea i) 
and the simulated value of 1 is 
2 
m 


1 1 1 


1 ff 
@-1 Ot Pontah ol 


Obviously, the larger the values of m and n, the better the estimate. Even for the minimum 
possible value, n = 2, the result is not bad, as shown in Figure 5.12. The three curves for n = 2, 3 
and 5 shown in the figure intersect at Tt = 3.14. 


—4 


Appendix 5-I: Approximation of the normal distribution for inferring the 
behaviour of its extremes 


The density of the standard normal distribution can be written as: 
f(x) = exp(—In V2n — x?/2) (5.75) 


By numerical investigation it is seen that an approximation of its distribution function is: 


co 


Fy(x) = 1- | fin) dy © 1 — gx c0.C1,€9) (5.76) 


where Cp,Ci,Cz are numerical constants and the function g(x; ap, a,,@z2), for any dg, a4, Qz, is 
defined as: 

= 2 
exp( (Ap + A,X + anx ve x20 


5.77 
1 — g(—X; a9, 44,2), x<0 (9.77) 


G(X} Ap, Ay, Az) = 


An interesting property is that the function g() is preserved under multiplication for x > 0: 
Q(X; Ap, Ay, Ar) g (Cx; bo, by, bz) = g(x; Ay + bo, Ay + Cy, a2 +b) (5.78) 
Notice that the density function fy(x) can itself be written as 
fu(x) = g(x; Inv2m,0,1/2) (5.79) 


Now for x = 0, Fy(x) = 1/2, s0 that g(0; cp, c,,c2) = 1/2 and in order for this to hold we must set 
exp(—C)) = 1/2 or cy = In 2. The constants c, and cz can be determined by minimizing the fitting 
error. Several combinations can provide a good fitting; here we preferred the following 
combination, easy to remember: 


2 22 4 
G=In2, 4=5, c= (5) =5 (5.80) 


This yields equation (5.44). 
We now determine integrals of the function g(), which are useful in other calculations. We 
consider a linear transformation of x, w = sx +m,w > 0. Using calculus of probability, we find: 
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co 





Fy(sx +m) =1 -| S fy(sw +m) dw (5.81) 
x 
1 (sw +m)? 
s fy(sw +m) = exp{ Ins — 5 inn) Se 
5.82 
7 1, , m? s?w? 287) 
= exp( Ins 3 In¢ Tl) oe msw 5 
For appropriate dg, a1, a2 and c, namely those satisfying: 
= Wm, m=a,/f2%,  nc=In [2+ a“ (5.83) 
S=./2a), m=a,//2a2, nce=In 7 Ao ro : 
equation (5.82) can be written as 
sfx(sw +m) = c g(W3 a, 41,42) = exp(Inc — (a9 taywt a,w”)) (5.84) 
Combining the above equations, we have: 
Fy(sx +m) =1- c | 9(W; ap, a4,a,) dw (5.85) 


x 


and solving for the integral we find 
1 — Fy(/2a2 x + a,/,/2a2) (5.86) 
V A2/Texp(ao — aj /4a2) , 


which is exact and valid for x > 0. If we approximate Fy using g( ) as in (5.76), then after algebraic 
manipulations we find: 


r a V2a a? 8a, 2,/2a, 8a 
i g(W; ap, 4,2) dw = g (x In (2/2) +A) + z S z + ee) (5.87) 
x 


| g(W; Qo, 44, Az) dw = 
x 





3/a, 36a,79 3 ° 9 


It can be verified that if (ag, a,,a2) = (Inv27,0,1/2) as in (5.79), then the corresponding a 
coefficients in the right-hand side of (5.87) become (In2, 2/3, 4/9) as in (5.80). 


Appendix 5-II: Approximation of the distribution of extremes of two 
correlated normal variables 


We assume that (x1, X2) have standard normal distribution and are dependent with correlation 
coefficient r. We define: 


y= max(x,,X2), a= min(x;,X2) (5.88) 


The exact probability densities are (Nadarajah and Katz, 2008): 





1- 1- 
fy) = 2fnODFr Rory , fel) = 2fw@)Fr{ -— |= 2} = 2fe@)-F@) 6.89) 


Using calculus of probability, we find: 
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1-r 1-r 
tT’ Fla] =~ | Tt 


However, there is no analytical solution for the exact distribution function (the integral of f,(z)) 


and we seek an approximation. For y > 0 the distribution function is: 


r 1-r 
Ky) = 1-| 2fn(w) Fn — w dw 


(5.90) 














y 
1-r 
=1-2] fawyaw +2 f fav) 1—Fy Tay dw (5.91) 
1-r 
= 2Fy)— 142 f ftw) 1— Fy fae dw 


y 


Using the approximation (5.76)-(5.77) and the property (5.78), the last integral in (5.91) becomes: 


r ja-r 
y 
r T 24 
~ | g(w; Inv2tt ,0,1/2)g —w; InZ,245 dw 
y 
00 (5.92) 
=| Saene 1-r See ae d 
ed A Sg OA Dey > 
y 
1417+ 97r —T 
— tn Aa(§ \it+r y +2.) 


oan exp (-2 Ws =) 














e 
3 
Consequently, combining (5.91) and (5.92), we find 


1 fiv7t+r 
ahi cal _— 
eS hae ee | (5.93) 


1 j17+1r 





By setting 


(5.94) 








equation (5.93) can be written as 
1—Fy(sy+m Ss m 
0) = 2h) -1 + exp (7S -)— aoe. 2F; wo) =1 + exp (> -) BCH (5.95) 


Note thatO<m< 1/v2,s > 1. 
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Likewise, for y < 0: 
y 


1-r 
RO) = | 2hvor)F rad dw (5.96) 


—0o 


The integral now is: 





y y 
| ford dw = [ o(—w InV2t,0,1/2)g (- dw (5.97) 


and hence: 


“ 
y 
Lr 2 /1-r41-r 1 
| ford tae dw = | s Ww; Invén,—3 5 tS dw 
-y 


2 -r41-r 1 
D\ WBN STs sae 91+r s 
(5.98) 
=e mee F ue 17 +17 _2 1-r 
T+r fone N\3J14+r ” I7+r 
2 fiv+r a 2 fivt+r (Ss) 
3 1+r -P 17 +r Sal toe I7+r 
m?\ Fy(sy — m) 
meeps =e 


























Summarizing, the distribution function of y := max(x;, x2) is: 


2Fx(9) — 1 ex ( -)acoen, y20 
Fy) ® ARS =m) (5.99) 
exp ae" ys0 


For r= 0, for which m = 2 /¥17,s = ¥17/3, the result is: 


v17 Fy(-V17y/3 — 2 /V17 
2Fn(y) - 1+ —— exp (2) ee ae / ) 0 
Fy) ® (5.100) 
y YF exp (2 eee , 
MC, 5 Be 
which is very close to (Fy (y)). 
The difference Fy(y) — K,(y) is symmetric about y = 0, given by: 
Fy(=sly| — m) 
Fu) — Fy) © Fn(-lyD — ep (> a (5.101) 
and at y = 0 it takes its maximum value, which is: 
Fy(0) — F,(0) == mein 5.102 
n(0) — F,(0) =5~ exp(— J — (5.102) 
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For completeness, the distribution function of z := min(x;, x2) is calculated in a similar 
manner and is found to be: 


; m?\ Fy(—sz — m) Sn 

con JAM ys 

F,(Z) = 2Fy(z) — F(z) ® NE aa (5.103) 
2Fy(z) — exp () — <0 


Chapter 6. Knowable moments and their relationship to extremes 


6.1 Rationale and definitions 


In Chapter 4 we have explained (and illustrated in Digression 4.B) that classical moments 
beyond order 2 or 3 are unknowable and their estimation from data is not feasible. In 
Chapter 6 we study a new type of moments, the knowable moments (K-moments) which 
can be reliably estimated for high orders and are useful in analyses of extremes. 

Let x be a stochastic variable and x1, Xp,..., Xp be IID copies of it, forming a sample. The 
maximum of all, which is the largest (pth) order statistic, is by definition: 


X(p) = max(Xx1,X2, ep) (6.1) 


It is readily obtained that if F(x) is the distribution function of x and f(x) its probability 
density function, then those of Xp) are: 


FO(x)=(F@))”, fx) = pfx) (Fo) (6.2) 


where the former is the product of p instances of F(x) (justified by the IID assumption) 
while the latter is none other than the derivative of F(x) with respect to x. The expected 
maximum of order p of x, i.e., the expected value of Xp), is therefore 


p-1 
E[x(p)] = E[max(x1, x2, eee) = pE I(#@)) x| (6.3) 
Likewise, the expected minimum of the p variables is: 
p-1 — p-1 
E[min(xy, x2, ...,Xp)] = pE l(a — F(x)) x| = pE |(F@)) x| (6.4) 
We must stress that the variables x ,X2, ws )Xp We consider here are not meant in 


succession in time and in this respect do not form a stochastic process, but are regarded 
as an ensemble of copies of x. In other words, the possible dependence in time in a 
stochastic process is not considered up to now (but will be considered later starting from 
its impacts on estimation; see section 6.9). 

The expected value in (6.3) defines a statistical moment, which, following 
Koutsoyiannis (2019a), we call noncentral knowable moment of order p: 


Ky=pel(r(x)) x], pet (6.5) 


The meaning of the term knowable will be discussed later, in section 6.4. By generalizing 
(6.5), we define the following variants of knowable moment of orders (p, q), where all 
definitions are valid for p = q. 

e Noncentral: 


Kbq = (D— a+ DE[(F(x)) x4] (6.6) 


e Central: 
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Kpqg = (P-Q+ VE (F@®) (@= u)"| 


where pl is the mean of x, i.e. pw = E[x@)| = Kia: 


e Tail-based (noncentral): 


—_/ 


e Hypercentral: 


p-q 


Ki, = (p— at DE[(2F(x)-1)” “(x-v)"], pea 


For brevity, will also refer to the knowable moments as K-moments. As evident from 


(6.7) 


Ryq = (P—9 + DE[(1-F(x)) x4] =(@-ag+DEl(F(x)) x4] 68) 


(6.9) 


the above notation for q= 1,K, =Kp, (and likewise for the other variants). The K- 


moments were introduced in Koutsoyiannis (2019a), even though, according to the 
current notation, the central and the tail-based moments were not included and the 


hypercentral moments were called central. The current version of central moments’ 
definition has some advantages with respect to the transformations by shift of origin of 
the variable (like x — c; see below). Nonetheless, all K-moment categories are related to 


each other and the relationships are given in Appendix 6-II. Some characteristic 
relationships or values for specific q or p are summarized in Table 6.1. 


Table 6.1 Characteristic relationships or values of different model categories. General equations 


for any p and q are given in Appendix 6-II. 


























Pq Characteristic relationships or values Eqn. no. 
q= 0 Kyo = Kpo = Kyo = K3p,0 =Up = Lo = 1) K3541,0 =0 (6.10) 
p=q=1 (gah, =k =; Ti Co one (6.11) 
q=1 Ky = Ky Ky = Ky +u (6.12) 
q=2 Kyo = Kp2 — 2uKp-a1 tu?» = Kyo = Kpo + 20Kp-aa + 2? (6.13) 
eas: Kpp = Kp = Up» Kp = Kg = bp (6.14) 
Kq+1q = 2Kqq — Kqti.@ Kg+1q = 2K qq — Ka+1.4 
p=q+1 1 (6.15) 
Kas14 = 2Kg41,q — 2Kqq, Kg+1q = 5 Ka+1.4 + Kqq 
Kq42,q = 3Kqq — 3Kg41,q + Kq+2,9) q+2.q = 3Kqq — 3Kg41q + Kat2,q 
p=qt 2 ; 3 ry 3 ‘. 1 ry (6.16) 
Kq+2,q = 3Kqq — 6Kq41q + 4Kq+2,¢) Kq+2,q = Kaa 3 a kat1a - q Ka+2,4 
Kq+3,q = 4Kgq — 6Kg41q + 4Kq+2,q — Kq+3,4 
Kq+3,q = 4K oq - 6Kg+1. + 4K g42.4 = Kq+3,q 
p=qt+3 a (6.17) 
Kii3q = —4Kqq + 12Kg41q — 16Kq+2.9 + 8Ka43.q 


q 


2 4 


1 3 
+ = + + 
K +3,q — -—35Kaq - 7 Ka+14 a 3 Ka+24 + g Ka+3,4 
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6.2 Theoretical calculation of K-moments 


Applying the definition (6.6), we can determine the noncentral K-moments from: 
Ki, =(p—q +1) | (FOOD)? “x4 f@d 6.18 
q=(-4 x x4 f(x)dx (6.18) 


The calculation may be facilitated if the distribution function is explicitly invertible, 
through the inverse function x(F) = F-4(F(x)). In this case: 


1 


King = ( —4a4+ 1) | xy FP-4dF (6.19) 


or, equivalently, 


1 q 
Kse= { (x( (FP =a) ) dF (6.20) 


Likewise, we calculate the central K-moments from: 


Kp =@— 9 +1) | (FOO)? Ge = w)" FODdx (6.21) 


or 


1 
q 


Kyg = (V—GQ +1) (x(F) — p)4 FP-4dF = I («( (FP =a) )-n) dF (6.22) 


0 


When analytical calculation is infeasible, numerical calculation of theoretical moments 
involves no difficulty; thus, the existence of an analytical solution of theoretical moments 
of a certain distribution should not be regarded as an important criterion for choosing 
that distribution. The important issue for model fitting is whether the moments are 
knowable or not, in the sense of their estimation from a sample; their theoretical values 
are always knowable once the distribution parameters have been specified. 


6.3 Specific cases of explicit expressions of K-moments 


Many customary distribution functions allow convenient theoretical calculation of the 
tail-based K-moments, while noncentral or central moments may not have an analytical 
expression. In these cases, we can evaluate the noncentral K-moments at a second step, 
exploiting the fact that the sequences of noncentral and tail-based K-moments are related 
through a binomial transform, whose details are contained in Appendix 6-I. As shown in 
Appendix 6-II, these relationships are: 
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p-qtl 
—/ = . Pp —_— qd + 1 P 
pq >» (-1)'( i Waa 
i=1 


p-qtl 


eG ee weg ore 


t=1 


x 


(6.23) 


Kpq 
and for the particular case q = 1: 


Dp Dp 
K, = 20 On, Ke ~ (YK, (6.24) 


This method is especially useful for a class of distribution functions, among those of Table 
2.5, which are most useful in the study of extremes of nonnegative variables. This class 
comprises the PBF distribution and its special and limiting cases, Pareto, Weibull and 


exponential, for which general analytical formulae for Ke are possible. These are 
contained in Table 6.2, which also covers the case of a mixed distribution with 
discontinuity at the origin, with Pp := P{x = 0} = 1-—P, (where 0 < P, <1). 

Hypercentral moments can also be determined by analogous relationships based on 
the binomial transform, which are contained in Appendix 6-II. 

We should stress, however, that numerical evaluation of the binomial transform works 
well for p of the order of several tens, but not of hundreds or thousands. The reason is that 
the binomial transform of a sequence for order p is equivalent to differencing the 
sequence p times and it is well known that differencing many times causes numerical 
errors, which may lead to runaway if p is large. Therefore, for large p, analytical 
relationships or other types of numerical approximations are desirable. 


Table 6.2 Analytical results for the tail-based moments of the PBF distribution and its special 
cases. The distributions are defined on [0,) with a possible discontinuity at zero, with Pp := 
P{x = 0} =1-—P, (0 < P, < 1). All parameters are positive: A is a scale parameter, € is the tail 
index and ¢ is a second shape parameter (lower-tail index). 

















Name Tail function, F(x) — Tail-based moment, Rog Eqn. no. 
oo P, (: Le ol 4apP-a1 7 rae B (F — 1 a t 2) (6.25) 
Weibull (¢ = 0) P, exp (- G)') AIpP-T (py — g +1)-V5 4 i (7) (6.26) 
Pareto’ (7 = 1) Pi (1+ yt aapP-at 208 (pate = a) (6.27) 
Gagan ew an 629) 
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Table 6.3 Analytical results for the noncentral K-moments of the Pareto, exponential, Dagum, EV1 
and EV2 distributions. The distributions are defined on [0,0o) with a possible discontinuity at 
zero, with Py = P{x = 0} = 1—P, (0 < P, <1). An exception is the EV2 distribution, whose 
support is the entire real line. All parameters are positive: A is a scale parameter, € is the tail index 
and ¢ is asecond shape parameter (lower-tail index). 


























Case, Tail K! Eqn. 
function F (x) a4 no. 
Ki =~ pa 1 oe 
p= zP ( ,p) ), bee? 
P 1P,=1 ae é 
areto}, Py = ,=——-> 
= ies ee (6.29) 
(1+e5)§ Pay ' 
7 2 = (2) (@- NBA -2%,p—1) -2BA-~Ep-1) +) 
a 2A? 
Pe W218) 
' A é ! AP, 
Ky =Z(PPEBp,—§,p)—1+-P)?),  K= Ty 
Pareto with Ay? (ot 
discontinuity, P,; < 1 p2 — (=) (@ — DP, (P; Bp, (1 — 28,p — 1) 
“4 (6.30) 
P,(1+é5) ; -2Bp,(1-§,p-1))- 0-7)? +1) 
212P, 
ioo= Se 
(l= 26) 3) 
—_ Po Tos 3A 
Exponential!, P; = 1 Kp = Hp, Ky = 4, K2 = a 
x 2 ' 6.31 
exp(—3) Kya = ((Hp-a)’ + Hp) 4%, Key = 20? een) 
Exponential with Ky =Ap Py 3Fo (11,1 — p; 2,2; Py) 
discontinuity P< Kpg =A (P—9 + D Py ( 3F2 (1,1,q — 32,2; Py) (6.32) 
P 2 — (q—1) 4Fs (1,1,q — 7; 2,2; P,)) 
1 PAP A ! ! ! 2 
Kk, =AP,, Ky = AP, (2 — P,/2), Ky, = A°P, 
Dagum, P,; = 1 
eg: Ky = AE/Z)§ ps BU — &, pf + €) eas 
1- ( +) ‘) Kg = ANE/O)" (Pp — 4 + 1S BA 48, (P—4 + 5 + 48) 
EV23, ye ee 
X\-E ee SP = (6.34) 
1 —exp{ -§ (5) Kpq = ME —q + YET ~48) 
EV14, Ky =A(np +) 
_x 1? 6.35 
PeexP (-e *) p2 = A? («ine SD oot =) te 


1 Ky. of the exponential and the Pareto case can serve as a basis for the calculation of the K,, of the special 
case (= 1/2 of the Weibull and PBF distributions, respectively. 


2 ,F;, is the generalized hypergeometric function. 
3 The EV2 distribution is derived from the Dagum distribution for ¢ > oo. 
4 The EV1 distribution is derived from the EV2 distribution by replacing x with x — 1/é, and letting € > 0. 
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For small values of g, the exponential and Pareto distributions admit explicit 
expressions also for the noncentral (and central) K-moments, which become simple if 
P, = 1. These are contained in Table 6.3. Another distribution, which admits analytical 
expressions of noncentral K-moments for P, = 1 is the Dagum distribution and its special 
cases, the Extreme Value type I and IJ distributions. These are also contained in Table 6.3. 
Note that the parameterization of the mathematical expressions of the distributions in 
Table 6.2 and Table 6.3 may differ from those in Table 2.5 as here we have given emphasis 
on displaying the interrelationships amongst distributions (e.g., the derivation of the 
exponential distribution as the limit of the Pareto distribution for € — 0). 

It is noted that the mixed distribution with a discontinuity at the lower bound (typically 
x = 0), is useful to examine, and is thus included in Table 6.2 and Table 6.3, for two 
reasons. First, there are natural processes, such as rainfall and occasionally streamflow, 
in which the probability dry (Pp := P{x = 0} = 1—P,) is nonzero (and thus 0 < P, < 1). 
Second, in several analyses we are interested about values of x above a threshold xg and 
in this case the distribution becomes discontinuous at Xp. 

If the discontinuity at the origin is quantified by P, := P{x > 0} and we denote the K- 
moments of the discontinuous and continuous distribution with and without a 
superscript ‘*’, respectively, then it is easy to see that the tail-based K-moments in the two 
cases are related by: 


—/* 


Ky = P?Ky (6.36) 


Based on this, we find in Appendix 6-II] that the noncentral K-moments in the two cases 
are related though a Bernoulli transform, whose properties are given in Appendix 6-I. 
Namely, K;," is the Bernoulli transform of K,, with parameter P;: 


p p 1 


Kit = >» Cc K/PL(1— P,)?-! = (1—P,)P b3 (5) K} (; =) (6.37) 


l=1 l=1 





Unlike the binomial transform, the Bernoulli transform does not entail numerical 
problems as it reflects summation of positive quantities rather than differences thereof. 
Yet a simple approximation is proposed, which can be useful in some cases: 


2 
P, Kip? <— are 
ee ic iP Pp P, se _ In(K3/Kj P,) eae 
Ky = 5» Pp =Pip, FOS (6.38) 
K! ta In(2/P,) 
p' P P, 


The rationale of this approximation is given in Appendix 6-III. 

While the PBF and the Dagum distributions, as well as their special and limiting cases 
discussed in this section, are quite general and convenient in their use, other customary 
distributions such as normal, lognormal and gamma, are in common use for several 
hydrometeorological processes. These distributions do not admit analytical relationships 
of K-moments, but good approximations are discussed in section 6.14. 


RELATIONSHIP OF KNOWABLE AND CLASSICAL MOMENTS 177 


6.4 Relationship of knowable and classical moments 
The classical moments can be recovered as special cases of K-moments: 
Kan = Hp» Ky = Hp (6.39) 


While the classical moments are extremely useful as theoretical concepts, and their values 
can be derived rather easily if the distribution function is specified, their estimation from 
samples is problematic if we go to moment order higher than 2 to 3. This has been stressed 
in the title of the article by Lombardo et al. (2014): “Just two moments”, while for the same 
reason in Koutsoyiannis (2019a) classical moments beyond that order have been termed 
unknowable (see Digression 4.B). 

As we will see in section 6.9, the contrary happens with K-moments. They can be 
estimated reliably even for a very high order p (hence their name knowable), provided 
that the order qg remains low. 

Not only can K-moments be estimated for high orders p, but they can also predict the 
value of the estimates of the classical moments. Next, we will derive this prediction for the 
noncentral classical moments in detail. We note that this prediction does not coincide 
with the true value of the classical moment. This may sound paradoxical as it is known 
that, for order p however large, Lp jis an unbiased estimator of Hp: In practice, however, 


the convergence of fy to u, is very slow, while the K-moments can give us an indication 
of what we can anticipate for the value of fi, which does not coincide with y,. In this 
respect, by examining the moment estimators, we will establish relationships between K- 
and classical moments broader and more essential than (6.39). 
For large p the classical moment estimator will give: 
n 


yd aoe, (6.40) 


i=1 


ele 


Ar 
Hp 


This is related to the well-known mathematical fact that the maximum norm is the limit 
of the p-norm as p > © as explained in Digression 4.B. Taking expected values in (6.40), 
we find: 


eae | n+p-1p 
E[Ap] ~ — El xfn)] = a (6.41) 


and since E[fi,,] = Mp, for large p: 


Knsp-1p = Np (6.42) 
or, equivalently, for large q and forp =n—q +1: 
Kpg © (P- 9+ Deg (6.43) 


from which for p = q, we recover (6.39), in this case holding precisely. 

However, if fi, is estimated from a sample and p is large, we do not anticipate that fi, 
would be close to the true classical moment ee Rather, because of (6.40), we can 
anticipate that: 
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 — @@) Kaa? (6.44) 
on oe” 
Likewise, if the classical moment is estimated as the average of m different (and 
independent) samples, each of size n, then: 

1x 1 1 1 


AT a AT ~ ___ Dp ~ __ Dp 
oy PG nn Ga an CO (6.45) 


and thus, we can anticipate that our so produced estimate will be: 


p 1 \P 
4 _ anny) (Kina) (6.46) 
bp ae 
mn mn 
To establish a more general formula that will give us the anticipated value for our 
estimate ji;,, applicable for small and large p, we observe that, because of (6.41), the ratio 
MN Ly /Knn+p-1,p iS 1 for large p, and thus multiplying the rightmost part of (6.44) by it 
will not have any effect if p is large, while having the desired effect if p is small and 
particularly if p = 1. Thus, our formula becomes 


Gay ul 


’ p 
Kinn+p-1,p 


Ap © (6.47) 


and is valid either for the average estimate from many samples (m > 1) or for just one 
sample of size n (m = 1). Indeed, for p = 1 and for any m and n, our formula (6.47), yields: 


Al 
K! 
A = Foams) = Uy (6.48) 


mn,1 


while for large p we recover (6.44) as already explained. 


Digression 6.A: Example on the relationship of knowable and classical moments 


We illustrate the relationship of K- and classical moments using synthetic samples with size n = 
100 from the exponential distribution with lower bound zero and scale parameter 1. This 
distribution has simple expressions of its moments, i.e., 


Payal Ie =i, (6.49) 


(see Table 6.2 and Table 6.3). For q > 2 the K-moment Kz, does not have a closed analytical 
expression but its calculation can be easily made by numerical integration. 

Figure 6.1 (left panel) shows comparison of the theoretical (true) moments, of orders 1 to 100, 
of the distribution with the empirical estimations from a single sample as well as from a set of 
1000 samples (by averaging 1000 estimates). The single sample estimates deviate from (are 
lower than) the true moments for p = 3 and the deviation becomes one order of magnitude as p 
approaches n = 100. We will refer to this deviation as slow convergence bias, because theoretically 
speaking there is no bias per se, according to the bias definition in section 4.3. From a practical 
point of view, the moment estimates are not the same thing as the true moments, despite the 
theoretical guarantee that the estimates are unbiased. Even the average of 1000 estimates from 
different synthetic samples deviates substantially from the true moments for order p > 10. On the 
other hand, equation (6.47) captures very well the behaviour of the estimates for both m = 1 and 
m= 1000. 


RELATIONSHIP OF KNOWABLE AND CLASSICAL MOMENTS 179 


Figure 6.1 (right panel) provides additional information on the same exercise and in particular 
on the terms appearing in equation (6.47) for the case of a single sample. First it shows that the 
true and estimated (K;,,)? agree, which indicates that even for p = n the estimate of K,,, is close 
to the true value (where the estimator will be discussed later, in section 6.9). Second, the graphs 
show the variation of the term K;,4-1, which appears in the denominator of (6.47). For small p, 


n+p-1p ~ (Kni)?, while for large p, Knip-1p © Up. AS a consequence, by virtue of (6.47), for large 


Pip Ua)", or Gye = K,,. This is seemingly different from equation (6.44), which gives 
























































ny \1/D : 
(a; = K},/n"/?; however, for large p (approaching n = 100), n'/? = 1 and therefore the two 
equations become consistent. 
100 
——=Theoretical 200 
—— Adapted theoretical for 1 simulation 
Adapted theoretical for 1000 simulations 
4 Empirical, 1 simulation 7 
2 4 Empirical, average of {1000 simulations Ce 
ma = 
= $s 
¢ 10 = 
fo) ~ 
£ < 
oO o 
= E —— 'p, theoretical 
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(K'n1)°/n, theoretical 
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Figure 6.1 (left) Comparison of the estimates of classical noncentral moments from 1 and 1000 
independent samples from the exponential distribution to (a) the true (theoretical) moments and (b) to the 
values determined by equation (6.47) (adapted theoretical). (right) Additional information of the terms 
appearing in equation (6.47). 


As a general conclusion, the classical moment estimators do not practically estimate classical 
moments but hybrid quantities involving both classical and K-moments, as implied by equation 
(6.47). The question arises then, when high-order moments are of interest, whether it is useful to 
involve classical moment estimates in statistical calculations, now knowing, from equation (6.47), 
what they really represent or it is better to use merely K-moments. To answer this question, we 
need to examine not the averages of estimates but the ranges in which they vary. 

This information is provided by Figure 6.2 which depicts moments from 100 simulated 
samples with length n = 2000 from lognormal distribution LN(0,1). K-moments, specifically Ky, 


in the left panel and K;,. in the right panel, are compared to classical moments Lr ~ ir To 


facilitate visual comparison, the order p’ of the classical moments was determined so that the 
estimate fi) coincide with that of Kyi and Ky2 in the left and right panel, respectively. This entails 
a specific relationship between p’ and p, which was determined numerically and is shown in the 
figure caption. 

A first observation in Figure 6.2 is that true and estimated K-moments coincide, while 
estimates of classical moments differ substantially from the true values. A second observation is 
that the 95% prediction limits, determined from the simulation, are much wider in the classical 
moments than in the K moments. The two prediction intervals approach each other only when p 
approaches n = 2000 (note though that the value of p for the classical moments when p’ = 2000 is 
much less than 2000, close to 160. All in all, the plots show that there is no benefit in using high 
order classical moments as K-moments are much more reliably estimated. 

The breadths of the prediction intervals are depicted in Figure 6.3 in terms of the ratios of 
maximum to minimum prediction limit. Even without considering the slow convergence bias, 
which, as illustrated in Figure 6.2, is very high for the classical moments, the broad prediction 
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intervals disfavour again the use of the classical moments. We observe in Figure 6.3 that for the 
third classical moment, the prediction limits have a ratio of 1.77 (higher to lower, without 
considering the slow convergence bias). The same ratio appears in K-moments at a moment of 
order as high as 250 (for q = 1) or 100 (for q = 2). Therefore, instead of estimating the third 
classical moment, it is safer to estimate and use K-moments of much higher order and specifically 
up to p = n/10 for Kp, and up to p = n/20 for Kp. 
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Figure 6.2. Comparison of K- and classical moment estimates for the lognormal distribution LN(0,1) from 
100 simulations, each with n = 2000. Order q of K-moments is q = 1 for the left panel and q = 2 for the right 
panel. The order p’ of the classical moments (see explanation in text) is determined from p’ = q + 
220 In(1 + (p3- q3) / (500 (q + 1)). PL stands for prediction limits. 
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Figure 6.3 Ratios of maximum to minimum prediction limits, as a function of moment order p, for the 
simulation experiment of Figure 6.2. 


6.5 Relationship of K-moments with order statistics and maxima 


From equation (6.3) it is evident that the noncentral K-moment of orders (p, 1) is none 
other than the expected value of the largest order statistic of a sample of x of size p. More 
generally, excepting tail-based K-moments, which represent expected values of minima, 
K-moments of all other categories represent expected values of maxima. To see this, we 
first consider the power transformation z’ = x? and further assume that it is monotonic, 
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which is the case if q is an odd number or if x is nonnegative. Then, it is readily seen that 
Fe @VSP\i7 22 =P lees = F(z’4), Consequently, 
p-q p-q 
Koq = (P-G+ DE[(F(x))  x*]=@-¢+HE[(Fe)) Z| 
I a v Ul 
=(p =¢q't DE (F(Z) z'| = rarer 
= E[max(x/, x7 , Gad )] = 5 ee a 


This means that the noncentral K-moment K;,, of x is identical to the expected maximum 


(6.50) 


of order p—q+1ofz' =x!. 

Furthermore, we consider the transformation z = (x — ii) assuming that q is an odd 
number, so that it be monotonic and thus F,(z) = P{z < z} = P{x <z/4 +p} = F@V4+ 
LL). Consequently, 

p-q q p-q 
Keg = @- 4+ DE[(F(Z)) — (&-2)"]=@-a4+DE[(FE4+H)) 4 
p-q 
=(p-q+tE (@@) | = Riana (6.51) 
= Efmax((x: — 1)", (2 — 1)" (Xp-an — 0) 
q 

= E[(max(xy,...,%)-ger) —#) | 
which means that the central K-moment K,,, of x is identical to the expected maximum of 
orderp—q+t+1lofz= (x—p)’. 

The above properties should also hold asymptotically, for large p, even if the 
transformations z or z’ are not monotonic (e.g. for even q) in any case of positively skewed 
distribution. Interestingly, for a symmetric distribution, a property analogous to (6.51) 
holds for the hypercentral moments with even q. Indeed, in this case and for z = 0, F,(z) = 
P{z <z} = P{-z/44+4y<x<2/94+p}=F(2/4+y)—F(-—z/47+p) and due to 
symmetry F,(z) = 2F(z4/4 + ») — 1. Thus, 


Kya = 4 + DE[(2F(x) — 1)" “(x - 4)" 
= (p—q+1)E[(2F(z/4 + pw) - 1)? “z] = 
=(p-q+I1)E (E@) ” z| = Kaa 
= Efinax( (es — 4), (42 — 0° mop gta —2)" 


which means that the hypercentral K-moment Kj, of a stochastic variable x with 


(6.52) 


symmetrical distribution for q even is identical to the expected maximum of order p — 

q + 1 of z. In contrast, for q odd, the hypercentral K-moment K,,, will obviously be zero. 
Now coming to the tail-based moments, assuming that the power transformation z’ = 

x7 is monotonic (i.e., q is an odd number or x is nonnegative), we will have F(z’) = 


P{z! > z'} = P{x > 2'/9} = F(z'/%); hence: 
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a = p-q —_ p-q 
Rog = (B- + DE|(F(x))  x*]=(@-a + DEl(F(z"/)) 2 | 
a p-q —/ 
=@- a+ DEF) Z| Bet aia oo 
= Efmin(x$,29 2 ges J] 
In section 6.9 we will discuss a further connection of K-moments and order statistics 
which enables estimation of K-moments from an ordered sample. 


6.6 Relationship of K-moments with L-moments and probability weighted 
moments 


L-moments (Hosking et al.,1985a,b; Hosking, 1990) represent a very useful and popular 
moment category as, contrary to classical moments, have unbiased estimators for high 
orders. According to their definition, L-moments are linear combinations of order 
statistics. Naturally then, L-moments are connected by linear relationships with K- 
moments for the specific case of q = 1. The relationships for the first four orders p are 
given in Koutsoyiannis (2019a). 

L-moments are also related to probability weighted moments (PWM), which had been 
introduced earlier than the former (Greenwood et al., 1979). Actually, the latter are more 
directly defined than the former and therefore are preferable—and in fact the estimation 
of L-moments is made from that of PWM. In particular, the definition of the PWM involves 

Dp Ss 
three different orders, p, s and q and is By sq = E (F@)) (1 — F(zx)) x4]. However, 
only the case q = 1 has been studied and the most common form, which is also used for 
p 
estimation of L-moments, fy = Bpo1 =E (F@)) x]. This is proportional to the 


noncentral K- moment for q = 1: 
Kp = PBp-1 (6.54) 


The small difference (multiplication by p) makes the K-moments more intuitive. First, it 
makes K-moments increasing functions of p, which is consistent to the behaviour of 
classical moments. Also, it links them directly to the largest order statistic as discussed 
above (see also Koutsoyiannis, 2019a). 


6.7 Summary statistical characteristics based on K-moments 


It is useful to note that the central moment K,,, of x is identical to the noncentral K- 
moment of y := x — w and, in particular, for q = 1, K, is the expected maximum of order p 
ofy=x—- i Observing that the distribution of y is K,(y) = F(y + 1) and the density is 
fy”) = f(y + L), we can easily verify the above property, which we can denote by the 
following notation: 


Kyq[x] = Kpqlx - 4] (6.55) 
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Useful properties of central K-moments are: (a) their invariance to a shift of origin and 
(b) their homogeneity in terms of multiplication by a scalar (see Appendix 6-IX). Both 
these properties are reflected in the relationship: 


Kyqla + bx] = b* Kyq|x| (6.56) 


It is thus evident that for any p, p’, q, the ratio 


Kpq [a ue bx] Kyq|x] 
=P eh ee 6.57 
Kprgla + bx] Kyrqlx] ae 


does not depend on a and b. In other words, this ratio is invariant under linear 
transformation of the variable. This makes meaningful the standardization of the variable 


x either by K2 if q = 1 or by Ky? if q = 2. The standardized variables: 
XH ca 
W, = , Woo Se eae 6.58 
K,[x] (Kyo[x])” to9) 


will have K, [wy] = K22[wo| = 1 and thus 


= Kp [x] 
K,[x]’ 





Kale] = Kul] = Fy 


Within the framework of K-moments and according to the rule of thumb “Just two 
[classical] moments” we may assume that the power of x, i.e. g, should be taken q = 1 or 2, 
while p can be however large. In this manner, for p > 1 we have two alternative options to 
define statistical characteristics related to moments of the distribution. As discussed in 
Digression 2.E, the most customary characteristics with respect to classical moments are 
those for orders p from 1 to 4, which indicate the location, variability, skewness and 
kurtosis of a distribution. Similar is the meaning if we replace classical moments with K- 
moments. These characteristics are shown in Table 6.4 for both moment categories and 
for three options. 





Ky[w,] = Kp[wa] (6.59) 


Table 6.4 Typical marginal statistical characteristics of distributions using different moment 
categories. Characteristics based on classical moments are also given as Option 3, while in Option 
1 those based on L-moments are shown, with A, denoting the L-moment of order p. 














Characteristic Order p Option 1 Option 2 Option 3* 
Location 1 Kj, = uw (the classical mean) 
Variability 2 Ky, = 2K21 = 2(K2 — B) Ky, = ee = be =~ o° 
= 2A, (the classical variance) 
Skewness 3 Ks _ 4 Ks _ 4 _ 4s Ks. _ 4 Kaz _, Ass 
(dimensionless) Ky, Ka Az Kx, OK a2 Ke o° 
Kay _ Ka gral 

aad CE + K 
Kurtosis 4 Ky Ky, Ky Kg2 ze ght? piise ns 
(dimensionless) 4A, 6 Ky Ky. K22 Ke. oF 
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Options 1 and 2 correspond to K-moments for q = 1 and 2, respectively, while Option 
3 corresponds to the classical moments. For the definitions of the characteristics of 
Options 1 and 2, the hypercentral variant of K-moments is used as it is more suitable for 
this purpose (see Koutsoyiannis, 2019a). The summary characteristics of Table 6.4 are 
useful quantified indications of the general behaviour of a stochastic variable and 
occasionally can also be used for the fitting of a distribution. However, as we will see in 
section 6.15, other fitting methods involving higher order K-moments may be preferable 
if we are particularly interested in extremes. 
By inspecting the formulae of skewness and kurtosis for Options 1 and 2, we can 
observe that all can be described by the single equation: 
K K,_ 
Keg ae ott 
with gq = 1 and 2 for Options 1 and 2, respectively, and p = 3 and 4 for skewness and 
kurtosis, respectively. Actually, for the kurtosis equation, (6.60) is slightly different from 
those in Table 6.4 (each one is a linear transformation of the other in each pair) but the 
generality and simplicity (avoiding use of the more complicated hypercentral moments) 
are advantages. 


(6.60) 


6.8 The K-moments as the Laplace transform of the quantile function 


The K-moment definition in general implies that the order q should be an integer in order 
to ensure that x7 is a real number. Also, so far, the order p was assumed to be an integer. 
However, if we specify q = 1 we can generalize that p can be a real number. Then we can 
write the K-moment as a function of the real variable p. Adapting (6.19) we write: 
1 
K'(p) = Ky =p | x(F) FP-1dF (6.61) 
0 

Now, given that F takes on values in the interval [0, 1] and in order to obtain a variable 
spanning all nonnegative real numbers, we use the transformations : 





K' 
F=e%, g(s)=x(e%), G(p) = ) (6.62) 
Then we can write: 
G(p) = | g(s)ePds = L{g}(P) (6.63) 


0 
where L{g}(p) is the Laplace transform of g(s). This can help in determining F by 


inverting the transform: 


g(s)=£Gs), x(F)=g(-InF), F(x) =e 9") (6.64) 
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6.9 Unbiased estimators of noncentral and tail Knowable moments 


P-q 
The quantities (F (x)) x1, whose expectations define K-moments, are estimated from 


a sample without using powers of x higher than q (which could be conveniently assumed 
to be 1 or 2), thus making the estimation reliable. The literature provides unbiased 
estimators on probability weighted moments and L-moments, both corresponding to q = 
1, based on the analysis by Landwehr et al. (1979). Here we adapt those results for the K- 
moments. 

As a general remark, the estimation of K-moments relies on that of quantities 


p-q 
(F(zx)) . This becomes possible if we involve order statistics, i.e., if we arrange the 


sample in ascending order. From the definition of the different variants of K-moments we 
can think of estimators of the form: 


n 


See p-q 
Koq = ———>) (F(xam)) Xhin) 
i=1 
= p-qt1 . p-q Pc 
Kyq = —) (F(xun)) (Xt) — A) (6.65) 
i=1 


n 

- —q+1 - 

Rhy = YF tem) — YP" (Xem — A 
i=1 


where x(n) is the ith element of a sample of x of size n, sorted in ascending order; the 
maximum of all x; is denoted as x(n) = X(n:n). It is stressed that the ordering of the sample 
is meant in terms of x and not x7. More precisely, Xen) = Gas) which can be different 
from (x4) 535) except if g is an odd number or if x is strictly nonnegative. 

While F(Xcny) has a simple estimator, which depends on i and the sample size n (and 
not on the specific form of F (x); see section 5.6), this estimator is not necessarily optimal 
if F(x¢iny) (or 2F (xci:n)) — 1) is raised to a power and is multiplied by x(j-n) (or X¢in) — A) 
raised to another power. In the subsections that follow we will examine better estimators. 

Based on the properties of order statistics (section 4.12) and the above discourse, it is 
reasonable to construct estimators in which F (Xen) do not depend on x;j.,) but only on 


p-1 
i and n. In this case the estimator of (p/n) (F (xciny)) is no longer a stochastic variable 


but a regular variable depending on i, n and p. If we denote it as b;,,,, then an estimator of 
the noncentral moment Kj, will be: 


n 
t=1 


K5q = >: Dinp-q+1 Ltn) (6.66) 


which for q = 1 takes the form: 
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>» 


n 
>= » Dinp X(in) (6.67) 
i=1 
We prove in Appendix 6-IV that for a random sample and q = 1 the estimator in (6.66) is 
unbiased if we choose: 

0, i<p 
= T(n- 1 rd 
binp = \P Ta—pt1) AO 1 BO (6.68) 
n I'(n) rd -—p+1) 


where p can be any positive number < n(usually, but not necessarily, integer). It can be 
easily verified that: 


n 

Dinp = 1 (6.69) 
i=1 
which is a necessary condition for unbiasedness. Furthermore, for p = 1, bin; = 1/n, while 
for p = 2, the quantity (n/2)binz can be regarded as the estimate of F(x(i:ny), i.e.: 

i-1 


— (6.70) 





F(x) = 
This has the interesting property of symmetry: 

F(x(n41-in)) == F(xcin)), (2F (xq@41-iny) = 1) = —(2F (xciny) =. 1) (6.71) 
Other special cases of K-moment estimator coefficients b;,, are shown in Table 6.5. The 
fact that biny = 0 for i < p suggests that, as the moment order increases, progressively, 
fewer data values determine the moment estimate, until it remains only one, the 


maximum, when p = n, with Dyn» = 1. Furthermore, if p > n then bin» = 0 for all i,1 < 
i < n,and therefore estimation becomes impossible. 


Table 6.5 Special cases of K-moment estimator coefficients. 





























Case inp Case binp 
1 1 
p=0 Dino = 0 p=n-1 Dr=taenai = 7 Pnnn-1 =1- a 
1 
P=1 ba: =— p=n Dann = 1 
n 
2i-1 ; p 
p=2 Bigg ==—— =n Pamp = 5 
= 3i-li-2 i=p bynp = pB(p,n — p + 1) 
nn n symmetry: Donp = Dn-pnn-pi 
Ss aa oe Ss minimum atp = n/2 
P= Digg = i : 








nn—-1n—2n—-3 
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An illustration of the performance of the unbiased estimator of equations (6.66)- (6.68) 
is given in Figure 6.4 for two distributions, lognormal and Pareto, where the estimates are 
indistinguishable from the true values, even for the highest possible order, p = n (= 2000 
in our example). 
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Figure 6.4 Illustration of the performance of the K-moment estimator of equations (6.66)- (6.68) 
for the lognormal distribution (left column; LN(0,1)) and the Pareto distribution (right column; 
tail index € = 0.15, scale parameter 6 = 1, lower bound ¢ = 0) for q = 1 (middle row) and q = 2 
(bottom row). For comparison the performance of the estimators of classical moments are also 
shown (upper row; notice that in the Pareto distribution the true moments are © for p > 1/0.15 
= 6.87). The true moments were determined by numerical integration for the lognormal 
distribution and from the equations in Table 6.2 for the Pareto distribution. The estimates are 
averages of 100 simulations each with n = 2000 and are indistinguishable from the true 
(theoretical) values. The 95% prediction limits (PL) are also shown. 
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Coming now to the tail K-moments, to get an unbiased estimator thereof, it suffices to 
reverse the order of the sample, i.e., to replace X(jn) with X(n~j41.n) in (6.67), which 
becomes: 


n 
t=1 


Ky = >. Dinp X(n-i+1:n) (6.72) 
where biny is again given by (6.68). 
6.10 Simplified estimators for independent samples 


p-1 
As can be readily verified from (6.66)-(6.68), the unbiased estimator of (F(zx)) is not 


precisely the (p - 1) power of F(xqy). However, we can find simplified and approximately 
unbiased estimators with this property. We generalize the estimator F(xci:ny) of (6.70) in 
the following form: 

F(xam) == — 
where a and bare constants or, more generally, functions of n. Then we form the estimator 
of Ky in one of the following two forms, 


prorpi-ay! >, prrji-ay? 
7 a (; — 5) Xm, Kp = -» (- = =) X(i:n) (6.74) 


where the difference is in the lower limit of the sum; the second form assumes that the 
weight of X(j.n) for i < p is zero, as in the unbiased estimator (6.68). 


(6.73) 





n 








ke) 


It can be easily seen that the condition 
a=b (6.75) 


ensures that F (x¢i:ny) will take values from 0 to 1, irrespective of the values x(j). 
Furthermore, the condition 
2a—b-1=0 (6.76) 
gives the estimator the symmetric property (6.71). 
It is well known (see section 5.6) that the values 


a=0, b=-1 (6.77) 


make a precisely unbiased estimator of F(x¢i:n)): However, here our aim is to find 


unbiased estimators of K-moments. As shown above, the values 
a=b=1 (6.78) 


are consistent with both the above requirements (6.69) and (6.71), and indeed have been 
used by Koutsoyiannis (2019a) for any order p, noting though that they result in some 
bias. Another common choice, proposed by Hosking et al. (1985a, b) for a similar case (see 
also Stedinger et al., 1993), is: 
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a=035, b=0 (6.79) 


which notably does not have the symmetric property (6.71). 
Here, we suggest for the first form of the estimator (6.74) the parameters: 


_1 +n—-—vVn2—- 
b=n—-—yVn?-1 (6.80) 
——2—_* 
with which the estimate in (6.73) becomes 
“ 2i-1—n+vn?-1 
F(x¢.2)) = ——— => (6.81) 
2vn2—1 


By construction, this has the property of symmetry of equation (6.71) and satisfies 
precisely the necessary condition for unbiasedness of equation (6.69) for p < 4. For larger 
p, the latter condition is slightly violated, but this can be fixed as described in Appendix 
6-V. 

For the second form of the estimator (6.74), after a systematic numerical investigation, 
the following parameter values have been found to be optimal: 


a eee joe 6.82 
t= 3 AGS ~ 2(n—4) Wie 


With these parameters, the estimate in (6.73) becomes: 


4i(n — 4) — 2n+7 


EG) = aaa (6.83) 


Again, this has the property of symmetry of equation (6.71) but it does not satisfy 
precisely the necessary condition for unbiasedness of equation (6.69) for p > 1. However, 
the error is very small, as detailed in Appendix 6-V. 

The above two estimators are simpler in their application than the precisely unbiased 
estimator (6.66)-(6.68) and could be used as alternatives for relatively low values of the 
order p. Deviations are negligible for p < n/10. 

Another, even simpler estimation option, a quick-and-dirty estimator using just one 
data point per K-moment, will be discussed in Digression 6.D. 


6.11 Approximately unbiased estimators of central K-moments 


For q = 1 the estimator of the central K-moments, based on Dipns i.e: 
n 
= >, Dinp (Xin) — 2) (6.84) 
i=p 
is again unbiased. Here we try to find an approximately unbiased estimator for g = 2. We 
start from the estimator that is given by (6.66) but replacing X(j.n) with x(n) — fA: 


n 
2 
2= D Dinos ( 1\*@ — ii) (6.85) 


i=p 


° 
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From classical statistics of the second-order moments (p = q = 2) it is known that: 


E|K>2| — K. 1 
E[Ko2| — Kye ns (6.86) 
Ky n 
It is not easy to derive theoretically a similar relationship for any p = 2 = q. However, 
a systematic simulation study shows that a good approximation is provided by 


generalizing the latter equation, i-e.: 


E[K, |= Kp | 1 





x 6.87 
Ks - (6.87) 
This results in: 
a n-1 
E[Kyo| ~ Kp2 (6.88) 


and thus, an approximately unbiased estimator is obtained by multiplying Ry2 by n/(n-— 
1). Obviously, the latter term becomes negligible if n is large. However, for small n, the 
bias should be taken into account. Simulation results for sample size as small as n = 10 are 
shown in Figure 6.5, which indicate the good performance of approximation (6.88). 





100 ; 
—— kK’): true @ = K'p estimate 
o———— K'p2 true B K',2 estimate 
> —— K,2 plus bias B® Kp2 estimate 
— Kp2 true 
8 
& 
o 
2 10 
oO 
> 
= 
i) 
€ 
fe} 
= 
1 
0 1 2 3 4 > 6 7 8 9 10 


Moment order, p 


Figure 6.5 Simulation results of estimated noncentral and central moments of a Pareto 
distribution from a sample with size n = 10 for order q = 1 and 2. The Pareto distribution has tail 
index € = 0.25, scale parameter f = 1, and lower bound zero. The true moments are calculated from 
the equations in Table 6.2. The estimated moments are average sample estimates from 10 000 
simulations. The curve “K,2 plus bias” corresponds to the right-hand side of equation (6.88). 


6.12 Effect of autocorrelation on estimation 


A K-moment is a characteristic of the marginal, first order, distribution of the process and 
therefore it is not affected by the dependence structure. However, the estimator is: time 
dependence induces bias to estimators of K-moments. Thus, the unbiasedness or 
approximate unbiasedness claimed in the previous sections ceases to hold in stochastic 
processes. 
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We quantify the effect of dependence by adapting the moment order for which the 
estimation is made. Specifically, the estimator of the noncentral K-moment for q = 1, i-e.: 


n 
Ky = > binpX(i:n) (6.89) 
i=1 


does not actually correspond to the theoretical K-moment for the same p, but to that fora 
smaller p’ + p, i.e: 


Ky =F (ee) z (6.90) 


Likewise, for central K-moments and for q = 2. We wish to find p’. 

Noting that in Kj = ff there is no bias, we begin studying K3. We recall from (6.3) that 
Ky = E[x2)] = E[max(x;,x;)], where (xi, x;) are independent. In case of dependence the 
quantity: 

CS E[max(x;,x;)] (6.91) 
where the superscript ‘d’ stands for dependence, will be different from K;. We may 
assume that in the case ofa time series with dependence, rather than of arandom sample, 
what we actually estimate is K;°, rather than K}. In case of positive dependence, K3° < K3. 

In Appendix 6-VI we determine that for a process with standard normal distribution 
the following adjustment is required: 

n-1 


Ki -K, 1 
O%(n;r(@)) = AE Cue r(t)(n—1) (6.92) 


where r(T) is the process autocorrelation for lag t. For a Markov process, in which r(t) = 


re 


2r 
(1-r)(n- 1) 


where the superscript ‘M’ stands for Markov. Equation (6.93) clearly shows that, unless n 


OMG ee (6.93) 


is very low and r very high (e.g. > 0.90), @“(n) = 0 and thus we can neglect the effect of 
autocorrelation. However, for an HK process, where r(t) ~ H(2H — 1)t?"~2, as shown in 
Appendix 6-VI: 


2H(1—H) ‘i 


HK bs fe 
OO GREY aaa Oe EE 


(6.94) 
For H > 1/2, the last term of the right-hand part is typically non-negligible and therefore 
the bias should be taken into account. A graphical depiction of adjustment coefficient 
@4K(n, H) is provided in Figure 6.6. 

Now, we assume, based on simulation results, that the same adjustment applies 
approximately on all orders p, i.e., 
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Ki -K 
@HK(n, H) = (6.95) 


Ky 


Under this assumption we find in Appendix 6-VI, the following approximation of a 
modified order p’ for which: 


p' =~ 20+(1-20)p(4+"), Ky, = KA = (1+ O)Ky (6.96) 


where for notational simplification we have shortened 0#(n, H) to @. 
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Figure 6.6 Adjustment coefficient 0 (= BK (n, H)) of the central K-moment for g = 1 for a Hurst- 
Kolmogorov process. 


The same approximation for p’ will hold for the noncentral K-moments. Further, we 
may use the same coefficients and results for q = 2. Specifically, for the noncentral K. the 
framework could be kept unchanged. However, for the central K,2, while we can keep p’ 
as derived above, we must have in mind that in the presence of HK behaviour Ky. is a 
biased estimator of Kz. Applying the results of section 4.6 (in particular, equation (4.24)) 
for the HK process we find that the bias is: 


E[K 22] — Koz _ 1 


This is similar to (6.86) expect that n is raised to the power 2 — 2H (for H=1/2 we recover 
(6.86)); also, this bias is roughly 20. Generalizing this for any p, as we did in section 6.11, 
and expressing it for the adapted order p’ we write: 


EIKyral—Ky'z _ __1 


= =-— (6.98) 


p'2 
Hence what we estimate in this case is the K-moment plus (negative) bias. 

While the derivation of these relationships in Appendix 6-VI was based on the normal 
distribution, the final equation (6.213) can be used as an approximation regardless of the 
distribution. Figure 6.7 provides an empirical confirmation that the approximation is 
satisfactory. 
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Figure 6.7 Illustration of the performance of the adaptation of K-moment estimation for an HK 
process with Hurst parameter 0.9 and lognormal marginal distribution (LN(0,1); same as in Figure 
6.4, left): (left column) noncentral moments; (right column) central moments; (middle row) q 
= 1 (bottom row) g = 2. For comparison classical moments are also shown (upper row). The 
estimates are averages of 200 simulations each with n = 2000 and are almost indistinguishable 
from the theoretical values. For K, (lower right) the true K-moments are represented by the 
dashed curve while the curve marked “theoretical” corresponds to the true plus bias, where bias 
is determined from equation (6.98). The 95% prediction limits (PL) are also shown. 


The simulations depicted in Figure 6.7 are for the lognormal distribution, which differs 
substantially from the normal, and a high Hurst parameter, H = 0.9. Interestingly, what 
would be assigned as K-moment for p =2000, without taking into account the effect of 
dependence, actually corresponds to the true K-moment of p’ ~ 500—and as we will see 
in section 6.14 this has dramatic consequences in the assignment of return periods to 
events. Another important observation from comparing Figure 6.7 with Figure 6.4 (left), 
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which are for the same marginal distribution, is the dramatic broadening of the prediction 
intervals. This illustrates how dependence amplifies uncertainty. 


Digression 6.B: Does periodicity affect estimation of K-moments? 


Hydroclimatic processes are influenced by regular seasonal or diurnal changes, or else 
periodicities. Often, in their stochastic treatment we use cyclostationary models, whose statistical 
properties are deterministic periodic functions of time. In other cases, we are interested in the 
overall behaviour of a process. For example, when studying rainfall extremes, often we are 
interested only in the magnitude of the extremes, rather than in the season in which an extreme 
has occurred. In such cases we can model the natural process as a stationary stochastic process, 
treating the periodicity indirectly through a periodic autocovariance function (Koutsoyiannis, 
2017). 

As an example, we assume a fully deterministic, strictly periodic process composed of a single 
harmonic with period a, described by: 


v(t) = cos(2m(t + b)/a), O0O<sb<a 


where b is the phase. This is a deterministic process but, when it is superimposed to a stochastic 
process, the resulting process is also a stochastic process. Therefore, we need to know the 
stochastic properties of v(t) as if it were a stochastic process. In particular, its autocorrelation 
function is (Koutsoyiannis, 2017): 


%,(h) = cos(2T1h/a) 


This does not vanish off for large lags; particularly for lags h that are multiples of the period a 
keeps a constant value 1. Therefore, as autocorrelation is occasionally high, irrespective of lag, 
one may suspect that this behaviour may influence estimation. However, this is not the case 
because, in fact, the autocorrelation alternates between positive and negative values and the 
average is Zero. 

We willillustrate this by a simulation experiment. To make it more interesting we superimpose 
a harmonic component to a Markov process and we will thus show that neither the Markovian 
behaviour nor the periodicity affect estimation. For the Markov process we use the simple AR(1) 
model in discrete time tT := t/D: 


Up =Tuz-1+V¥1-Tr? w, 


where w, is standard Gaussian white noise. It is easily seen that u, is Gaussian: 





eae 
ful) = a= 
with autocorrelation for discrete lag 7 := h/D: 
m() =r" 


The marginal distribution and density of the harmonic function are (cf. Markonis and 
Koutsoyiannis, 2013): 
Bee arccos(v) Oe 1 eee 
U v)= TT , 16: v) = Sia SUVS 


and its variance is var[ v] = 1/2. Now we form the composite process 
Z:= f2/3(ut+ Vv) 


which has variance var[z] = (2/3)(1+1/2)=1 and autocorrelation (identical to 
autocovariance): 


™2(n) = (2/3)(r" + cos2mnD/a)) 
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The marginal density function of z, obtained from the convolution equation (2.102), is: 


2 
: veon(-3(e-») 
f.(2) = = | St 


f v1—u2 


This integral cannot be evaluated analytically, but as shown in Figure 6.8(a), its numerical 
evaluation suggests that it is close to the Gaussian density with some noticeable deviations at 
large values of the stochastic variable. If we make the transformation 


y=2|z\° 


and determine its density using equation (2.11), then that density becomes almost 
indistinguishable from that of the normal distribution, specifically that of u. 
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Figure 6.8 Illustration of theoretical and simulated results for the example of Digression 6.B, where the 
final process x has lognormal marginal distribution LN(0,1), the lag-1 autocorrelation of the Markov 
process is r= 0.5 and the period of the periodic component is a = 20: (a) probability densities of variables 
u,v, z and y; (b) autocorrelation of processes z,x and y; (c) 100 terms of time series from processes x and 
y; (d) classical moments; (e)-(f) K-moments for q = 1 and 2. Panels (d)-(f) are identical to those of Figure 


6.4 (left), which is for the same marginal distribution. The estimates are averages of 100 simulations each 
with n = 2000 and in the case of K-moments are indistinguishable from the true (theoretical) values. 
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This is depicted in Figure 6.8(a). Therefore, if we make the final transformation 


x= e 


we will get a variable x with lognormal marginal distribution LN(0,1), as that shown in Figure 6.4 
(left), but now with Markov dependence and periodicity, as described above by the 
autocorrelation 7,(7). As shown in Figure 6.8(b), the autocorrelations for y and x do not differ 


substantially from that of z, r,(7).The latter are calculated by stochastic simulation to avoid other 
type of numerical integrations which are more tedious. A feeling about how the generated time 
series look is obtained from the plot of Figure 6.8(c). 

Now we perform a simulation experiment for the K-moments as that presented in section 6.9 
and depicted in Figure 6.4 (left), which is also for the lognormal distribution LN(0,1). To estimate 
K-moments we use the unbiased estimator of equations (6.66)-(6.68) without any type of 
correction at all. Panels (d)-(f) in the figure show the classical and K-moments, theoretical and 
estimated. Comparing them with the respective plots of Figure 6.4 (left), we clearly see that they 
are identical. This confirms our claim that periodicity and Markov dependence to not affect K- 
moment estimates. 


6.13 Estimation by merging information from dependent records 


Assuming that we have several observation records, representing the same stochastic 
process, we can use them simultaneously to improve our estimations. If the records can 
be regarded as random samples that are independent of each other, we can merge them 
in a single sample and use the merged sample for estimation. Assuming that there are m 
samples, each with length n,, we will then form a merged sample of size n = mn, and we 
can apply the usual estimation procedures for that sample, while estimation uncertainty 
will depend only on the size n—and obviously will be smaller than that from a sample 
with size n,. 

But what if the different records represent correlated processes? This is a frequent case 
we meet in practice; for example, if we study rainfall observed at several adjacent stations. 
Again, merging the records improves the estimation reducing uncertainty, but not as 
much as if there was a single random sample of size n. 

We denote xjpt=1,..,mj =1,..m4, the stochastic variable representing the jth item 


of sample i and fi; = (xi4 Sai pn, )/ My the average of the ith sample. We assume that 


the samples are cross-correlated with same correlation r= 0, i.e., corr|xj;, x;";| =Pixi, 


while corr[x;;, x;r;"| = 0,j # j’. Itis easy to show that, ify = var|x;;|, then the variance of 


fi; is y/n, and the overall average is: 
fi feet fi 
yo = var an = zd +r(m-—1)) (6.99) 
m n 
In a sample of size n the variance would be y/n. If we introduce an equivalent sample 
size n' so that the variance in (6.99) be y/n’, then it is readily found that 


—_ n 
ot +r(m-—1) Co100) 


It can be noted that as m increases (tends to infinity) the equivalent sample size tends to 
an upper limit, which is nyax = n,/T. 
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In the above analysis, whose results have been well known for many decades (Yule, 
1945; Matalas and Langbein, 1962; Castellarin et al., 2005), the determination of the 
equivalent sample size has been based on the variance of the sample average. We can 
repeat the analysis for higher moments, although this is more complicated. Stedinger 
(1983) has shown (for normal distribution) that if we use the variance of the sample 
variance as a basis, then an equivalent sample size n” is derived by a relationship 
analogous to (6.100): 

7 n 
n= T+rm-D (6.101) 
Obviously, n” >n' and the upper limit as m>0o becomes n",,, = 17,/1r° > N'nax. This 
suggests that as the moment order increases, the information gain from merging records 
increases too. Mimiyianni (2010) has confirmed this property in a systematic simulation 
study. This is very encouraging if we study extremes, as extremes are related to high- 
order moments. 

The K-moments estimation is thus expected to improve substantially by merging 
records, particularly for high orders. The analysis we have made in section 6.12 could also 
be used in this case as by merging cross-correlated records we induce autocorrelation to 
the merged sample. However, the problem is more complicated now, as it depends on 
three parameters, r,m and n,. What is required is again to estimate an adapted moment 
order p’ for each estimated Kj,. Stochastic simulation is obviously a generic method that 
can easily handle this problem. However, a quick technique for practical use is this. 


e Forp<n, wesetp’ =p. 
e For p > 7, we use the HK framework of section 6.12 for H corresponding to r, i.e.: 


1 In(1+r) 


= — 4+ —___ 6.102 

2 oe 21n2 ( J 

Specifically, we apply (6.94) to find O(n, H), and then modify (6.96) to estimate 
p’ as: 

p' = 20+ (1- 20)(p —n, +1) +n, - 1 (6.103) 


It can be readily verified that for p = n,, p’ = p. 

Figure 6.9 presents two examples with application of the above technique for m = 100 
subsamples of size n, = 20 each and for a low and a high value of correlation coefficient. 
The simulated subsamples were constructed by the equations x;; = exp Yij, Vij = Vo; + 


V1 - av; where vyjpt =0,..,m,j =1,..,n, are independent stochastic variables with 
distribution N(0,1). The parameter a was chosen 0.5 and 0.8 for the low and high 
correlation, respectively, resulting in correlation coefficients for y, r = 0.25 and r = 0.64, 
respectively, which were used in calculations. (The correlations for x are r= 0.16 and r= 
0.52, respectively.) 
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Figure 6.9 Illustration of the performance of the adaptation of K-moment estimation for a merged 
sample composed of m = 100 correlated sub-samples of size ni = 20 each. Each sub-sample is 
random with lognormal marginal distribution (LN(0,1); same as in Figure 6.4, left). The 
correlation of the logarithms of the variables (which have normal distribution) are r = 0.25 (left 
column) and r= 0.64 (right column). Noncentral K-moments are shown for q = 1 (middle row) 
and q = 2 (bottom row). Classical noncentral moments are also shown (upper row). For 
comparison, estimates from one sample with size n; = 20 are also shown. All estimates are 
averages of 100 simulations, while 95% prediction limits (PL) are also shown. In the left column 
(r = 0.25) the adaptation turns out to be negligible and the adapted curves are indistinguishable 
from the non- adapted. 


The figure shows clearly that merging of samples enables estimation of K-moments of 
much higher order, of about 1000 or more, while a single sample allows estimation up to 
order 20. The importance of estimating high order moments will become clear in section 
6.14, where it will be seen that estimates of high-order K-moments are equivalent to 
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quantile estimates of high return periods. In addition, the estimation uncertainty, 
quantified by the prediction intervals, is substantially reduced in the case of the merged 
sample, in comparison to that of an individual sub-sample. However, comparison with 
Figure 6.4 (left), which is for the same marginal distribution, shows that the uncertainty, 
is still higher than in a purely random sample of equal length. 

A final question to discuss is how to deal in situations where the subsamples are not 
IID but time series from a process with HK behaviour. Again, stochastic simulation 
provides the means for proper adaptation of the estimates. A quick-and-dirty solution is 
to use the HK framework of section 6.12 for the entire range of moment orders and for a 
“bulk” Hurst parameter. In this case the variance of fi; is y/n?~2" and equation (6.99), 
which gives the variance of the overall average, is modified to 

y= ee cl +r(m-—1)) (6.104) 
np “mM 


The bulk Hurst parameter could be specified according to the following equation, which 
fits an HK climacogram to the time scales 1 and n: 


In(y™/y) 
= a 6.105 
ee 2 In(n) ( J 
This results in: 
In((1+r(m—1)))+Inm 
i= (1 ss aa n In((2++r@n=1)) +Inm (6.106) 
Inn 2 In(n) 


Illustration of this technique is provided in Figure 6.10 for 100 time series of size 20 
each. Each of them was constructed with the same method as in the examples of Figure 
6.9 except that the variables v;; for each i follow the HK process with H = 0.8, while they 
are independent for different i. The parameter a was chosen 0.8, resulting in correlation 
coefficients for y, r = 0.64. The resulting Hy, is 0.892 and the overall performance of the 


method, compared to accurate simulation results is satisfactory (even though with 
slightly higher H = 0.91, not shown in the figure, the results would perfectly correspond 
to the simulation results). Figure 6.10 shows that even in this case merging of time series 
enables estimation of K-moments of much higher order, of p’ ~ 650, while a single time 
series allows estimation only up to order p’ = 16. However, the estimation uncertainty, 
quantified by the prediction intervals, is not reduced substantially in the case of the 
merged time series. 


6.14 Return periods of K-moment values and A-coefficients 


As we have seen in section 5.6, order statistics have an important advantage over other 
statistics, as to each of them we can assign a value of the distribution function and hence 
of return period. This turns out to be the case also with K-moments as they are closely 
related to order statistics. Intuitively we can expect that the return period corresponding 
to the noncentral K-moment of orders (p,1), i.e. the value x = Kj, will correspond to a 
return period of about 2pD (where D is the time step or, more generally, a time reference 
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for the specification of return period). This is precise for a symmetric distribution and for 
p = 1, as K; is the mean value which has return period 2D and, as we will see below, it 
cannot be much lower than 2pD for any p and for any distribution. 
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Figure 6.10 Illustration of the performance of the adaptation of K-moment estimation for a 
merged time series composed of m = 100 correlated time series of size n; = 20 each, generated 
from the exponentiated HK process for H = 0.8, having lognormal marginal distribution (LN(0,1)). 
The correlation of the logarithms of the variables of the different time series (which have normal 
distribution) is r = 0.64. Noncentral K-moments are shown for q = 1 (left) and q = 2 (right). For 
comparison estimates from one time series with size n; = 20 are also shown. All estimates are 
averages of 100 simulations, while 95% prediction limits (PL) are also shown. 


Generally, we can express the return period by the relationship: 


at p) 


= App (6.107) 


where A, is a coefficient, already introduced in section 5.6, which generally depends on 
the distribution function and the order p. As we will see, the range of variation of A, is not 
wide. As a first rough approximation, the rule of thumb: 


7) ») 


a = 2p (6.108) 


helps intuition. However, the precise definition of A, is: 


= eT) (6.109) 


For given p and distribution function F(x), Ky is analytically or numerically determined 
from its definition. Then T(K,) and A, are determined from their definitions. 

In absence of an analytical solution, we can establish an exact relationship between p 
and T by performing numerical calculations for several p. The small variation of A, with 


p makes possible a very good approximation if we first accurately determine the specific 
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values A, and A... The value A, is very easy to determine, as it refers to the return period 
of the mean: 


1 _ TW) 


A, = —————- = 
*“1-F@) ~~ D 


(6.110) 


Furthermore, in a number of customary distributions, specifically those belonging to the 
domain of attraction of the Extreme Value Type I distribution, A,, has a constant value, 
independent of the distribution. As shown in Appendix 6-VII (see also section 5.6) this is: 


Ag = eY = 1.781 (6.111) 


where y is the Euler constant. 
For the approximation of A, we can use the following simple relationship, which is 


satisfactory for several distributions: 


A, — Aw 
Ap © Aco + at) (6.112) 
p 
This yields a linear relationship between the return period T and p: 
T(K; 
Aa) = DAy AwoP + (A; = Aw) (6.113) 


However, in some distributions like the lognormal and Weibull, the decay of Ay with 
increasing p is very slow. Furthermore, in some cases A, is not always a decreasing 
function of p, as implied by equation (6.112). To account for such cases, we may enrich 
(6.112) adding two more parameters f and B, according to the expression: 


Spe ae B 
ty = he + Bin (+h) 


A= A, —Ae+ Bin(8"(1+75)) grad 1 pick 
re 


where the expression for A ensures exact recovering of A,. Most interesting are the cases: 


(a) £ =0, in which: 


(6.114) 





A 1 
Ay * Ay +> Bin(1 A= Ay Ae + BIn(1 +5) (6.115) 


: In(p + 5): 
(b) 6 = 1, in which: 
A 1 
Ay * he +> Bin(1 +=), A=A,—-—A. + BIn(2) (6.116) 
(c) B = —1, in which: 


A 1 
Ay ~ Aa +> Bin(14+ 5°), A= Ay — Aw + BIn(3/2) (6.117) 
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The rationale and details of this approximation are given in Appendix 6-VIII. The resulting 
return period is 
TK) oo wo ; B 
<= Ply ~ Peo + A— Bpin Ce ety eat, (6.118) 

We refer to the approximations (6.108), (6.113) and (6.118) as the rule-of-thump, 
linear and nonlinear approximations, respectively. 

Table 6.6 gathers the equations for a number of customary distributions giving all 
quantities that support the approximation of the entire series of A,. Furthermore, Figure 
6.11 illustrates the very satisfactory approximation achieved by the above method for all 
distributions. A prominent characteristic of the lognormal distribution is the very slow 
decay of the A-coefficient. If compared to the Pareto distribution (also shown in the 
figure), the lognormal distribution has the limit A,, = 1.781 against 2.255 of the Pareto 
case. However, even for moment order as high as 100 000, the former distribution retains 
a A value much higher than the latter. 

It is interesting that, according to equations (6.112) and (6.114), only two to four 
numbers, namely the coefficients A, and A, and occasionally 6 and B, can approximate 
the complete series of A,, practically for any distribution function. In turn, as the A- 
coefficients are independent of the scale and location parameters (see below), by adding 
the latter (in the form of u = K; and any of K2, Kz, or their noncentral variants) to the 
three A-coefficients we obtain an efficient parameterization of any distribution function. 
This is justified by the fact that the A-coefficients, along with a location and scale 
parameter, determine the complete series of K-moments Kp, while the latter are related 
to the quantile function via the Laplace transform (section 6.8). 
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Figure 6.11 Illustration of the approximation of A-coefficients achieved by equation (6.114) for 
the indicated distribution functions. 
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Table 6.6 Characteristic parameters for accurate approximations of A-coefficients and K- 
moments for several customary distributions. 


Distribution! and tail 















































ae A Noo B 
function, F(x) B 
Normal, 
_ 2 
? exp (- Cw iF 2 ev 0 0.73 
i V21 0 . 
Exponential2, , 2A, — 3Ao _ 
en-x/A e e 1 244 —In2) = 0.152 
Gamma3, 
Pa) EA, ev 0 ~0.3 In(A, — 1.93) — 0.05 
r© re 
Weibull, : 0.02 
x5 (ra+p) ay ¢ 1.16 + 1.2(Ae — A) 
exp (- (5) ) ne — 0.003 
Lognormal, 
o (_ An(u/a))? 2 a 0.73 — 26 — 0.403 
ep (-Tze? ) ih erfe(o/23/2) “Ine +1 ) oe 
J V210Uu 
4 
tie, ; 7 2Ay +E -3)Ac 
xy\ = ie) es Kida) 2(1 —In2 
(1 ae @) (1-8 @—In2) 
1 0.02 
re, p(4-2,2) S\ 3 g Wag 
1 Feo is Ee 
Pe nie 74 Fer (fea Ss Se FGAe) + gage Stee @ ee Ay 
ESSE (5) $ —0.003 
Dagum,° A 
yee = - A 26Ay ~ (26 + § ~ 1A —F = 1 
(142) ‘ 1-((ea-ec+ oy" +1) nse ah) 
fav) 
EV25, 7 
ee ef 2(A1 — Ae) — 1 
oe — Fé Aa = malian 
1— exp (- (5) ‘) fee (-ca —é)) ‘) M$) 2 — In2) 
EVI’, ; F ; DONE Ne 1. 
1 — exp (-e77) 1/(1 — exp(-e’)) 4 2(1 — In2) 


1 For all distributions the domain is x => 0, except for the normal and EV1, in which x € R. Linear transformations 
of x that change the lower bound have no effect on any of the characteristics given in the table. 
2 The exponential distribution can also be approximated as a special case of the Gamma or the Weibull 


distribution for ¢ = 1, and even by the simple approximation of equation (6.112), but the specific approximation 
with B = 1 is the best. Furthermore, it admits an exact equation, based on the relationships of Table 6.3, which is 
A, =e” /p. 

3 The linear approximation (equation (6.112)) is also good for the gamma distribution but only for ¢ < 1. 

4 The Pareto distribution can also be approximated as a special case of the PBF distribution for ¢ = 1, and even 
by the linear approximation (equation (6.112)). However, the specific approximation given here with 6 = 1 is the 
best. Furthermore, it admits an exact equation, based on the relationships of Table 6.3, which is A, = 


(p+1-HBU-Ept1))“*/p. 
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5 The Dagum distribution also admits an exact equation, based on the relationships of Table 6.3, which is A, = 
-1/é -¢ 
a/(1-((pgBA- Eng +6) +1) )p. 
6 The EV2 distribution and all its expressions can be derived from the Dagum distribution for ¢ > 09; its exact 
1 
equation, based on the relationships of Table 6.3, is A, = u/(1 — exp (-(ra = 5) */p)) D. 


7 The EV1 distribution and all its expressions can be derived from the EV2 distribution by substituting x/A <— 
x/A + 1/€ and letting € — 0; its exact equation, based on Table 6.3, is A, = 1/(1 — exp(-e”/p)) p. 


Likewise, to evaluate, precisely or approximately, the tail moments for any distribution 
we introduce the tail-based A-coefficient of order p as: 


= 1 
A 


> pF (K,) (6.119) 


where A has similar properties with A, and in particular varies only slightly with p. For 
p = 1 itis readily seen that 


Ay = 1/F(u) = Ay/(41 - 1) (6.120) 
The limiting value A,, depends only on the lower-tail index ¢ of the distribution: 
Aw =T(1+1/f)7% (6.121) 
and its own limits are: 
lim A, = 0, lim A, =e’ 
a pants (6.122) 


For the normal distribution, which is symmetric, Avo = A. = e’, while for the exponential 
distribution A,, = 1. 

Like in the case of A-coefficients for noncentral moments, here again we give three 
different approximations. Proceeding from the least to most accurate, these are: 


1. The rule-of-thumb approximation (zero-parameter): 


ra e positively skewed distributions 6.123 
Pp (2, negatively skewed or symmetric distributions uot2>) 
2. The linear approximation (two-parameter): 
cy oak ei “URRY gee thes as 
Ay * eo + 1 ( 4) ~ Aap + (Ay — Aco) (6.124) 


p D 


3. The nonlinear approximation (four-parameter, with the parameters B and B 


additional to A, and A,,). Assuming PB > 0 we have: 





a B =e so. B 
Ty ~ Ao + 5+ Bin(1+—+__), A=7,~Ay-Bin(1+ 8 ) (228) 
p (p+1)% -1 R= 4 


The most interesting cases are: 


(a) 6 = 0, in which: 
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Si. ott OS 1 ee ea 1 
Fy ee a (1 ——), qa — Fy Bin(1 aa) 6.126 
P Aces ak TET 1 oye 


(b) 6 = 1, in which: 
Ba et. An 8 tes 1 ee ee 
Ay ~ Aa. ++ Bin(1+-), A= A, — Aw — BIn(2) (6.127) 
Table 6.7 gathers the equations for a number of customary distributions giving all 


quantities that support the approximation of the entire series of the tail A-coefficients Ap. 


Table 6.7 Characteristic parameters for accurate approximations of tail A-coefficients and tail K- 
moments for several customary distributions. 


Distribution! and tail = 






































function, F(x) a Aon E 7 
Normal, 
_ 2 
“exp (- 2 ) 2 ev 0 0.73 
——__—d 
| Vato 
Exponential2, 2A, -—3 
= =x/a : 1 1 soe G7 
e e—1 2(1 — In2) 
Gamma3, 
Pa) hs ra¢i/2s 9 2.27-2In(2.2- Ay) - 24, 
T'(Z) rq) “% Ir(¢) 
. z - os 
Weibull, Acer) 2 2A, — 2A —1 
# ee ee (1 +1/Z)- ——_— 
Pere Oe fad Z 2(1 — In2) 
P ( (5) ) (P+) oe 
Lognormal, 
Bs (- coq) i oY 0 0.73 + 2.40 — 0.703/2 
20 di 2 — erfc(a/23/2) 
J V210u 
Pareto‘, _ 
. 1 2A, +&-3 
xy\\ § bs . : 2(1 —1In2) 
(+6) 1-(1-£) 
1 
PBF, =i ee 
et a(d 1 ty S\ GE 1 2A, + A(G& +F& —2) -CE-1 
x c¢ FETT -¢ = 
(1+26()) el fees fae ra +1/¢) 2(1 —In2) 


1 For all distributions it is assumed that x => 0, except for the normal, in which x € R. The exponential, Weibull, 
Pareto and PBF distributions admit exact equations, based on the relationships of Table 6.2, 
2 The equations for the exponential distribution are identical to those of Weibull for ¢ = 1 and can also be 


approximated as a special case of the Gamma distribution for ¢ = 1, and even by the linear approximation 
(equation (6.124)), but the specific approximation with B = 1 is the best. Furthermore, its exact equation, based 
on the relationships of Table 6.3, is A, = 1/(1 - e '/P)p. 

3 The linear approximation (equation (6.124)) is also good for the gamma distribution but only for ¢ < 1. 

4 The Pareto distribution can also be derived as a special case of the PBF distribution for ¢ = 1. 


5 The approximation given for PBF setting B = 1is good for ¢& < 1 and needs further study for ¢& > 1 (B > 1). 
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We stress that the equations of both Table 6.6 and Table 6.7 refer to continuous 
distributions. In the case of mixed distributions with a discontinuity Pp) = 1— P, we 
should have in mind that the A,, retains its value as in the continuous case, while A,, = 0. 
In practical applications, it is advisable to work with the continuous part of the 
distribution and treat the discontinuity separately. 

Recapitulating the above discourse, the A-coefficients have the following important 
properties: 


e They vary in a narrow range (close to 2 for noncentral K-moments or close to 1 for 
tail K-moments of positively skewed distributions) and this facilitates the 
determination of the complete series by only a couple of them (namely, A, and A.) 
and, if higher accuracy is required, by the additional parameters f and B (and 
likewise for the tail K-moments). 

e They are well approximated by generic functions, irrespective of the particular 
distribution function. 

e Their definition in terms of return period renders them suitable to study extremes. 

e Also, their definition, in connection to their generic approximators, supports the 
indirect but quick determination of theoretical (true) values of K-moments of any 
order in absence of analytical relationships. 


The last point suggests that we could follow a similar approach for the calculation of K- 
moments for higher q. Specifically, we can generalize (6.109) and define A-coefficients of 
orders (p, q) as: 

Ana? — 
Pq = 1/q (6.128) 
(p-qt 1) (1 - F(KyG ) 
for noncentral moments and 


1 
iz (p—q+1)(1-F(K,i + 11)) 


for central ones (and similarly for tail K-moments). We readily observe that for q = 1 we 
recover (6.109) and specifically: 


Are Gels (6.130) 


A 


pq (6.129) 


as Ky, = Kpg + u. However, for q > 1, Ayg # Apg except in the limit as q > o. Like the A- 
coefficients of q = 1, Ap, will also vary in a narrow range. In particular, as the tail index of 
the distribution of x? will be qgé, the limit Nisa = Awog can be readily determined. Thus, 
with reference to the distributions in Table 6.6 and for q = 2, those that have tail index 
zero and A, = e’, will also have Aj. = Ao2 = eY = 1.781. Furthermore, those that have 


1 
Aw =T(1 — 2) will have A). = Awo2 = (T(1 — 2€))/*5 and more generally, for q < 1/€, 
Nvog = Aoog = F(A — 9). 
Now if we consider a linear transformation of the variable x, i.e... w = bx +c, then the 
following equations, whose proof is contained in Appendix 6-IX, hold: 
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Kyq[w] = b*Kpqlx], — Apq(w] = Anal] (6.131) 


In other words, the central A-coefficients are invariant under linear transformations of 
the variables. This enables the theoretical calculation of A,, by linearly transforming the 
original variable x to one witha simpler expression of distribution (e.g., with lower bound 
or mean zero and unit scale parameter) and performing the calculations on w. Then Ky, 
can be determined from (6.131) through the central K-moments of the simpler 
distribution of w. The noncentral K-moments can be determined from the central K- 
moments; the relationships are contained in Appendix 6-II. 


Digression 6.C: The behaviour of the normal distribution 


Here we will find approximations of the K-moments and the A-coefficients of the normal 
distribution. Using the approximation of the normal distribution by (5.44) and its quantile 
function by (5.45), we find in Appendix 6-X that: 


p 
Payne p _4)/k m= 2 
Kp = By +) (f)CD'B, Ay = A (6.132) 
k=1 p exp (-3% (1 +345)) 
where 


1/2 


B, =p | x(F) F?-1 dF 
0 


3VTe?/* erfc 2 
oe Vi (Jp/ ) (6.133) 
2pt2 
These approximations, depicted in Figure 6.12 in comparison to the exact values, are rather 
satisfactory. However, evaluation of (6.132) beyond p > 100 is problematic because of the 
numerical instability of the binomial transform contained in it. 
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Figure 6.12 Comparison of approximate and exact values of K-moments and A-coefficients of order q = 1 
for the normal distribution. Approximations 1 and 2 are calculated by equations (6.132) and (6.134), 
respectively, and the exact values are calculated by numerical integration. 


A better approximation can be obtained in a simpler manner from equation (6.115) with 
coefficients from Table 6.6. This yields: 


A tet 0.731 (1+ : ) 6.134 
a. p eae In(p + 1) ons) 
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As seen in Figure 6.12 for p up to 100 and in Figure 6.11 for much higher p, this latter 
approximation is very accurate and thus preferable over that of equation (6.132). 


Digression 6.D: Do return periods assigned by order statistics and 
K-moments differ? 


Both frameworks of order statistics and K-moments provide means to assign empirical return 
periods from an observed sample. In order statistics we use equation: 


TEi:n) - n+B 





5.57 
D i= ap Al ( J 
for specified A and B, while for the K-moment of order (p,1) we use equation: 
TUR 
G(p) = (Kp) _ pA» (6.109) 


D 


where specific forms of A, and hence of the function G(p) have been extensively discussed in 
section 6.14. Here we use the linear approximation: 


G(p) = Awp + (Ay — Aw) (6.135) 


The two approaches for assigning return periods are comparable at certain return periods, 
namely those corresponding to integer i between 1 and n in equation (5.57), for which the return 
period T(j.n) is defined. Given T;;.n) there is a specific p such as nie) = Tin) = T. This is given as 
j= Ge (Tan): 

Thus, to each of the n values T(j.,) in a sample we can assign values x in two different ways: 

1. Xin), Le., the ith smallest of the n observations; 
2. con = K,, where p = G1 (Tany/D) and K;,, is estimated from the entire sample by equations 

(6.66)- (6.68). 

These two values should be close to each other in general. We can make them precisely equal 


if we use the approximation (6.135) for the latter along with the approximation of A and B of the 
former corresponding to unbiased quantile (equation (5.71)). Equating the two we have: 


Ne (a oa ca 6.136 
cooP (A, marrage (6. ) 
and hence: 
i-1-(A,-1)(”-i) n—p 
Dat ee =nyarel Ca | serena oe 


Note thatforp = 1,i=n—(n-—1)/A, (eg.i = n/2 if A, = 2) and for p =n,i =n. 

This cannot work for p < 1 or for i = n — (n—1)/A,. For small i we can continue with tail 
moments of order p = 1, assuming return period of nonexceedance of Ky equal to T = A,p + 
(A, — A), with T = 1/(1 — 1/T). Thus: 


1 nt Ay/Ac — 1 


1-1/(deb+(4,—An)) E+ (6.138) 


and taking also (6.120) into account and setting A,. = Aw (1 + 5)/(A; — 1) (for some 6 which, as 
we will see, turns out to be very small) we find: 
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A = LI So hse 1 de =p = NO ae OE = 1 
5a 14 = Dm-d) pe COSTS 
(1+ 5)(A.@ —1) +A, - 1) A, +A~(1 + 6)(p — 1) 
For a symmetric distribution, A, = 2 and A,, = A,, and thus 6 = 0. For the skewed distributions 
contained in Table 6.7 by comparing with Table 6.6 and investigating numerically, it can be 
verified that 6 is small, of the order of 2 % - 5 %. Thus, neglecting 6 we find: 


(4,-1M—-H-it+1 MPA = 1) 
Ag(i-1)+A4,-1 ’ A, +A(p — 1) 
Based on the above results, for a given moment order p a quick-and-dirty estimate of the 


noncentral moment Kj, is the value x,j,n), the ith smallest value of the sample, with i determined 
from equation (6.137). Likewise, if i is determined from equation (6.140) for a given p, then the 


eee (6.140) 


value X(j:n) is a quick-and-dirty estimate of the tail K-moment Kp. In essence, the quick-and-dirty 
K-moments approach is equivalent to the order statistics one. 

For numerical illustration we use the Pareto distribution with tail index € = 0.15 and other 
parameters as shown in the caption of Figure 6.13. In this case T(;.,) will be determined from the 
unbiased estimator of the quantile of the Pareto distribution (corresponding to unbiased quantile, 
case VI of Table 5.5) with €= 0.15: 

Tii:n) a n+ 0.452 
D  n-i+0.491 
For the noncentral K-moment of order (p, 1) and for p = 1 we use linear approximation 


G(p) = Acp + (Ay — Aco) = 2.035p + 0.92 





Thus, 
i 20.02 
~ 2.035 


This is precisely the result we get from the quantile-unbiased estimator of the order statistics if 
we setn = i = p. For return periods T/D < A, = 2.95 the resulting p is smaller than 1. This is not 
a problem as the definition of K-moments has already been extended for non-integer order p 
except that the linear approximation is no longer accurate. 

Therefore, we use the tail K-moment of order (p,1); for »p=1 we apply the linear 
approximation: 


= 0.491 T/D — 0.452 


G(p) = Anp + (Ai — Aw) = B+ 0.512 
Thus, 
p =T/D—0.512 =1/(1—- D/T) — 0.512 

Note that for p=n, 1/(1— D/T) =p+0.512 or, after the algebraic manipulations, T/D = 
(n + 0.512)/(n — 0.488). This is slightly different from (n + 0.452)/(n — 0.509), as given by the 
order statistics approach. For n = 100, the former formula gives T/D = 1.0100 and the latter 
1.00966, a difference of 0.04 %. This slight difference is due to the fact that 6 is not exactly zero 
(namely, 6 =(A~/Aw) (Ay — 1) — 1 = (2.035/1) x 0.512 — 1 = 0.042). 

Alternatively, as the Pareto distribution yields an exact solution (see Table 6.2), we could use 
that instead of the approximations. Namely, this is: 


G(p) = (P~+1-HBA-~E p+) 


and is valid for all p, integer or not, including for p < 1. 

Simulation results for our example are shown in Figure 6.13. On the average, the order- 
statistics and the K-moment approaches give equivalent results and equally good, both in terms 
of averages and uncertainty limits (prediction limits). However, if we focus on a single realization, 


1/g 
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such as the one also shown in Figure 6.13, the K-moments approach yields a smooth arrangement 
of empirical points, while that of the order-statistics approach indicates a greater variability and 
a rougher arrangement. The reason is that in the K-moments approach each K-moment value is a 
weighted average of several points, while in the order statistics only one value is used each time. 
As regards the order-statistics approach, we note that there would be a substantial difference in 
the largest value if we adopted the Weibull plotting position formula, which, as explained in 
section 5.6, we have deemed inappropriate. 
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Figure 6.13 Simulation results of empirical return periods assigned to Pareto quantiles (for tail index 
€= 0.15, scale parameter A = 1 and lower bound zero). Averages and prediction limits (PL) were calculated 
from 200 simulations each with n = 100. The curves of averages for both the order statistics and the K- 
moment approaches are indistinguishable from the theoretical curves. The return periods T;j.n) were 
assigned by (left) the generic option of unbiased InT and (right) the unbiased quantile option. The 
correspondence between the K-moment of order p and the return period T is also shown through the upper 
horizontal axis. The plots of a single realization are also shown (but for part of the empirical points to avoid 
an overcrowded graph). 


The above illustration helps us to formulate a few focus points of more general validity: 


1. The largest value in an observed sample of size n does not have a return period of about n time 
units as commonly assumed. Rather, it is the order p of the maximum K-moment that is equal 
to n. Thus, the return period of the maximum observation is about Aon, usually 1.8 n to 2n. 

2. In both approaches, the order-statistics and K-moments, the results are virtually (or even 
exactly) the same in terms of expected values and uncertainty. 

3. While with order statistics we can empirically assign return periods only to the observations, 
thus designating only n specific values of return period, with K-moments there is no such 
restriction. Rather, we can empirically assign a return period to any quantile value between 
the smallest and the largest observation. 

4. Because of the more accurate formulae for K-moments discussed in section 6.14, in 
comparison to those of order statistics discussed in section 5.6, the return periods empirically 
assigned by the K-moments approach are typically more accurate or at the very least 
equivalent to those of order statistics. 

5. In addition, while with order statistics only one observation is used for each assignment of 
return period, in the K-moments approach each K-moment value is a weighted average of 
several observations, thus contributing to the accuracy of estimation. 


For all these reasons the K-moments approach is deemed preferable. We note though that 
computationally it is more demanding. 
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6.15 Extreme-oriented estimation via K-moments 


As we have discussed in section 2.19, using the entire data set to model extremes is 
preferable than using block (e.g. annual) maxima. Usually, the values above a specified 
threshold are chosen, while the remaining values are discarded in modelling. However, if 
we use the K-moments there is no need to set a threshold. All values can be used, but as 
we have seen in section 6.9, ina sample of size n the estimation of the K-moment of orders 
(p, 1) relies only on then — p + 1 largest values, thus rendering thresholding unnecessary. 

In Chapter 4 we have discussed two different approaches for fitting distribution 
functions to data. The method of maximum likelihood is well reasoned and is based on an 
optimization logic. In contrast, the method of moments is based on solving equations and 
is not quite rigorously argued. Assuming that we fit a two-parameter model (say, a two- 
parameter gamma distribution) the method of moments uses the first two classical 
(noncentral) moments and determines the two parameters by equating the sample 
moments to the theoretical moments of the distributions. One could raise two major 
questions on the logic of this method: 


1. Why use the first and second moments and not, say, the second and third? One may 
easily justify the standard choice of using the lowest possible order of moments by 
the fact that higher moments are less accurately estimated. On the other hand, one 
may counter that, when we are interested in extremes, these are better reflected in 
higher-order moments. It is well known that a model can hardly be a perfect 
representation of reality. Thus, we cannot expect that a good model fitting on the 
first and second moment would be equally suitable for the distribution tail, i.e. the 
behaviour on extremes. 

2. Why use two moments and not more? The standard answer, that two equations 
suffice to find two unknowns, may be adequate from a theoretical mathematical 
point of view but it is not from an empirical and engineering one. (As the saying 
goes, to draw a straight line a mathematician needs two points but an engineer 
needs three). Certainly, an optimization framework (as in maximizing likelihood or 
in minimizing fitting error) is much preferable and superior to an equation solving 
method. 


Having introduced the concept of K-moments we have already seen several 
advantages, which are particularly strong for an extreme-oriented modelling. The 
following three properties are highlighted: 


1. They are knowable with unbiased estimators (from samples) for high orders, up 
to the sample size n, while the estimation uncertainty is by orders of magnitude 
lower than in the classical moments (section 6.9 and Digression 6.A). 

2. The estimators can explicitly (albeit approximately) take into account any existing 
dependence structure (section 6.12). 

3. The K-moment values, can directly be assigned return periods, through A- 
coefficients, similar to what happens with order statistics, but with some 
advantages over the latter (section 6.14). 
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With the above points in mind, we can now formulate our extreme-oriented 
distribution fitting based on these postulates. 


1. We use all n data. 

2. From the data we estimate all K-moments, from orders 1 to n (alternatively, we 
could choose a subset of them, e.g. 100 values of p arranged in a geometric 
progression from 1 to n). 

3. We make a climacogram of data and assess if there is long range dependence. In 
the case there is, we adapt the moment orders using equation (6.96). 

4. We assume a model, i-e., a marginal distribution function with some parameters 
and for that model we establish the relationship between moment order p and A- 
coefficients Ap, which define return periods T/d = 1/pAp. 

5. We choose a range of return periods depending on our focus on extremes and over 
that range we fit the model so as to minimize the mean square errors (possibly 
weighted) between the empirical and theoretical K-T relationships. 


In all points the moment order q is assumed to be 1. 

A possible criticism on using high order K-moments (up to order n) is that this gives 
higher weight to the highest observations, which are more uncertain than the low ones. 
This criticism would be valid if the true distribution function was known to be the one 
chosen as a model for the real-world process studied. But this is hardly the case. Let us 
assume that in the time series of flow observations we have three very high values and 
that we have chosen a certain model, e.g. a Lognormal distribution. How can we be sure 
that the model is correct? If we are not sure (which actually is always the case), and if we 
are to design a certain engineering construction, would we prefer a fitting of the chosen 
model that is consistent with theoretical considerations, e.g. based on the maximum 
likelihood method, even if this yields a departure for the three high values? Or would we 
feel safer if our fitting represents well the three high values? 

The framework is illustrated in Digression 6.E for rainfall extremes in Bologna. 


Digression 6.E: Extreme-oriented estimation of rainfall in Bologna 


The record of daily rainfall in Bologna has already been discussed in section 1.3 and Digression 
2.H. The period of observation is Top, = 206 years and it includes n = 19 426 nonzero rainfall 
depths (all other daily rainfall values are zero. Therefore, the time reference of defining the 
distribution function (conditional on x > 0) and return period is: 


D=Tops/n = 206 years/19426 = 0.01060 years = (1/94.3) years 


(which means 94.3 rain days per year). As already discussed in Digression 2.H, the average daily 
rainfall during rain days is 7.2 mm and the maximum 155.7 mm. 

Here we will see several options for fitting a marginal distribution to nonzero daily rainfall, we 
will study the differences among them, and we will trace a nearly optimal option. Firstly, for the 
sake of illustration we intentionally choose the simplest and blatantly unsuitable model, the one- 
parameter exponential distribution: 


Fa) =1—-e°%7 


In this case, one moment suffices to estimate the single (scale) parameter A —but which moment 
to choose? The standard option is to choose the first moment, the mean, so that A = pw = 7.2 mm. 
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This would be the same as choosing classical moments, L-moments, etc. The maximum likelihood 
method would also result in the same estimate of J. 

What if we chose a moment higher than 1 for the estimation? The exponential distribution is 
quite convenient and yields simple analytical relationships for all types of moments. Thus, the 
theoretical K- and classical moments are: 


2 
Ky = (Hp —1)4,  Kyn =((Hp-1—-1) + HE, )2?, Kp = Hyp = CAP 


where H,, is the pth harmonic number, ie is the pth harmonic number of order 2 and ! p is the 
subfactorial of p. If we estimate the sample moment Ky, Rp2 or fi, and equate it to the respective 
theoretical quantity as above, we obtain another estimate of A. The resulting estimates are plotted 
in Figure 6.14 (left) vs. moment order p, whilst some of the resulting fitted distributions are 
plotted in Figure 6.14 (right) in comparison to the empirical distribution. 

It is evident in Figure 6.14 that the moment order p affects the fitting dramatically. Specifically, 
the scale parameter A increases with increasing p and q. If we wish to model maxima, it is better 
to fit based on the thousandth K-moment than on the first! This clearly shows that, as far as 
extremes are concerned, high-order K-moments are preferable to low-order K-moments. 

In a next step, we fit and compare both the exponential and the Pareto distribution in two 
cases: for the entire data set (size: 19426 for 206 years) and for values over threshold (VOT), 
where the threshold (47 mm) was chosen so that the sample contain 206 values (size equal to the 
number of years). Specifically, the two distribution functions are, respectively, 


F(x)=1-e@/-9), F(x) =1-(14+ E(x/A-8)) VF 


Comparisons of empirical and theoretical distributions are depicted in Figure 6.15. The 
exponential distribution was fitted with one parameter (setting ¢ = 0) for all data and with two 
parameters for the VOT case. The Pareto distribution was fitted with two parameters (setting ¢ = 
0) for all data and three for the VOT case. Obviously, physical consistency demands that ¢ = 0 but 
violation of this condition can improve the fitting. 
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Figure 6.14 (left) Estimate of the scale parameter A of the exponential distribution from the pth moment, 
fitted on the Bologna rainfall record; the circle corresponds to the standard estimate by any of the methods 
of classical moments, L-moments, K-moments and maximum likelihood. (right) Resulting fitted 
distribution, as a graph of x vs. T, for the indicated values of p; the empirical distribution is calculated by 
formula III of Table 5.5. 


For the Pareto case, the methods of moments and L-moments were used, with the lowest 
orders (1, 2 or 3, depending on the number of parameters of the theoretical distribution). Figure 
6.15 shows that there is not a clear winner between the moments and L-moments methods. When 
the entire data set is used, the fitting is quite unsatisfactory for the distribution tail (extremes). 
Yet the classical moments fitting shows better performance than the L-moments. If we use the 
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sample over threshold and the three-parameter Pareto, classical moments and L-moments give 
fittings very close to each other (with slight advantage of the latter on both small and high values). 
Among the two options, all data and VOT option, the latter gives a better fitting on the maxima— 
but at the expense of an additional parameter and a physically inconsistent nonzero minimum. 
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Figure 6.15 Fitting (in terms of plot of quantile x vs. return period T) of the exponential (left) and Pareto 
(right) distribution on the Bologna daily rainfall record by the indicated methods; the empirical 
distribution is calculated by formula III of Table 5.5. 


Let us now examine two questions: Can we improve the first option, so that the lower bound 
be zero for physical consistency? Can we use the entire data set and fit on the distribution tail? 
The answer to both questions is positive and in fact the first question has already been discussed 
in Digression 2.H. Here we will study them exclusively using the K-moments, both for assigning 
empirical return periods and for distribution fitting. We assume Pareto distribution with zero 
lower bound: 


ene = T/D)s -1 
F(x) =1-(1+éx/A”) 5, TS) = at ex/ay, x= ge 
The estimated K-moments have return period: 


T(R; 1 

Ao) pA, = (GAS oBd— <p): 

(With negligible error, we could also use the approximation T(K,)/D = A..(p — 1) + A1.) We 
estimate the parameters € and A by minimizing the mean square error of the logarithms of the 
empirical T (Kp) from the theoretical T(K,). (Minimizing the error of K, with respect to Ky, 
without reference to T, is another possibility.) We calculate the error for a range of T from 2 years 
to the maximum value that the sample size allows. The fitted parameters are shown in Table 6.8. 
The fitted distribution function is depicted in Figure 6.16, which shows a perfect agreement of 
theoretical and empirical curves for T > 1 year (the two curves are indistinguishable). For 
comparison, empirical curves for the order statistics are also plotted but these have not been used 
at any step of the fitting procedure. 

The model shown in Figure 6.16 (right) is quite satisfactory, almost perfect, as far as the 
distribution tail is concerned. The proximity of empirical and theoretical curves is remarkable, as 
is the physical consistency and parsimony of the model, which contains only a scale parameter 
and a tail index. 

By changing the parameters, we can obtain a better fit on the entire set of values but at the cost 
of worsening that on large return periods (this has been already done in Figure 2.4). Alternatively, 
by adding one parameter to the theoretical distribution function, we can obtain a model 
applicable for the entire range of rainfall depth, without compromising the performance on large 
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return periods. Namely, we use the Pareto-Burr-Fuller (PBF) distribution, again with zero lower 
bound: 


1 1 
iL T ayo \SE (T/D)s§ — 1\F 
PIG a(n GaGa (: Gee) . & -1(2@—) 

Now for the fitting we use the same estimation procedure as above but calculate the error on 
the entire range of values. However, we give less importance to the low quantiles by weighting 
the square error at each point with the quantile itself. In this case we have two different fitting 
variants. In the first we do not apply any constraint in parameters and in the second we keep the 
tail index € as estimated for the Pareto distribution (€ =0.098, in order not to distort the good 
fitting on the tail). The parameters are shown in Table 6.8. A perfect fit of the model and the 


empirical curve for the entire range of return periods is seen in Figure 6.17 (referring to the first 
variant). 
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Figure 6.16 Fitting (in terms of plot of quantile x vs. return period T) of the Pareto distribution on the 
Bologna daily rainfall record by the indicated methods (left) assuming independence and (right) 
accounting for long-range dependence; the curves of theoretical and empirical K-moments are 
indistinguishable for T > 1 year. The empirical distribution from order statistics (calculated by formula III 


of Table 5.5 and not considering dependence so that it is the same in both panels) is also plotted for 
comparison. 
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Figure 6.17 Same as Figure 6.16, but for the Pareto-Burr-Feller distribution fitted for the entire range of 
return period (left) assuming independence and (right) accounting for long-range dependence; note that 


empirical return periods based on order statistics do not consider dependence and thus they are the same 
in both panels. 


Referring to the numerical results in Table 6.8, we can provide a final comparison focusing on 
the question, which is the design value (distribution quantile) for return period T= 1000 years. 
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e If we followed the dominant approach of using Gumbel (EV1) distribution on annual maxima, 
which is equivalent to using exponential tail for the parent distribution, then the design value 
would be 152.4 mm. Note that this is lower than the record observation, which is 155.7 mm. 

e Ifwechanged the distribution from exponential to Pareto, thus being more consistent to recent 
findings (see also discussion and references in Digression 2.H and in Digression 8.F), the design 
value would be 173.7 mm, a 15% increase. 

e By assuming Pareto tail and also accounting for dependence, the design value becomes 218.3 
mm, a 43% increase in comparison to the initial estimate of 152.4 mm. 

e Additional changes can arise if we use a model with more parameters, such as the PBF 
distribution; however, it is not clear if these changes would increase or decrease the design 
value, as the direction depends on additional assumptions. As the additional parameters result 
in higher uncertainty, it may be preferable to use the more parsimonious Pareto model. 


Note that the increases due to methodological improvements of a consistent stationary 
stochastic framework, are much larger than those usually published in modern literature 
identifying increases attributed to global warming (see Koutsoyiannis, 2020b). The finding here 
may lead to the following suggestions for fine scale rainfall extremes: 


Assume stationarity. 

Use Pareto tail. 

Take dependence into account. 

Fit based on K-moments of high order. 


Table 6.8 Comparison of model parameters and resulting quantile values for characteristic return periods 
for the different models fitted. 











Distribution Fitting Tail Scale Location Lower- T (years) 
assumptions! index, parameter, parameter, _ tail 100 1000 10000 
é A (mm) € index, ¢ 
Exponential I-VOT 0 15.27 3.07 1 117.2 152.4 187.6 
Pareto I-HT 0.098 8.30 0 1 122.9 175.5 241.3 
PBF I-A 0.042 6.12 0 0.786 124.1 173.7 230.7 
PBF I-A 0.098 7.07 0 0.928 124.0 179.9 250.5 
Pareto D-HT 0.120 8.85 0 1 147.7 218.3 311.4 
PBF D-A 0.058 6.49 0 0.775 148.7 213.5 290.8 
PBF D-A 0.120 6.98 0 O891 U5ii6 229.7 333.9 





1]: independence; D: dependence; A: all data; VOT: values above threshold; HT: high return period, T = 1 
year. 


Appendix 6-I: The binomial identity and the binomial and Bernoulli 
transforms 


The binomial identity is: 
Dp Pp ; 
P) i. p-i P) (* 
P= iyp-i — yp = 
G@+y r=) (P)xtyrisy DOG (6.141) 
i=0 i=0 
where p is a nonnegative integer and x and y are any numbers. The identity can be expanded for 
any real (or even complex) p. Assuming |x| < |y| (to guarantee convergence), the identity takes 
the form: 


(@+yp=)> (P)xtyrisy 


co 
i=0 i=0 


OG : (6.142) 
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and since for integer p and for i > p the binomial coefficient (°) is zero, (6.141) is readily 
recovered from (6.142). 
In our stochastic context, it is interesting to study the case where x represents a stochastic 


variable and y a number. We thus get the following characteristic cases: 


© xoxuy=1 


M8 
i 
~ OD 
ny 


(x+1)? = (6.143) 


© x3-xy=1 


(1-x) =) @)Cnix! (6.144) 


1=0 


e x Px,y=1-—P,typically for0 <P <1: 


(Px+1-P)” = 


Ms 


(7) Pica — PyP-tx! (6.145) 


1=0 


If we take expected values on (6.144), and denote a; := E[x'], b= E [a = x)|| we can write: 


Dp 
b=) (*) via (6.146) 
i=0 
This latter equation defines the binomial transform, which is self-inverted (involutory), i.e.: 
Pp 
dp = » (P) ain, (6.147) 


i=0 
Using the symbol B for the binomial transform we can write: 


by = (Ba) © dy = (Bb) (6.148) 


Likewise, if we take expected values on (6.145), and denote c; := E [(Px+ 1- P)'] we can 


write: 


p 
Cy = > (7) Pia — PyP-ta, (6.149) 
i=0 


U 
This latter equation defines the Bernoulli transform with parameter P. If we denote it with the 
symbol B? we can write: 
Cy = (BP a), (6.150) 
The relationship between the Bernoulli and binomial transforms is found as follows: 
i 


c= (1- py @ (= a,=(1—P)?(Ba')y, al, = (3) ap (6.151) 


1=0 


Consequently: 


218 CHAPTER 6 - KNOWABLE MOMENTS AND THEIR RELATIONSHIP TO EXTREMES 


Cy = (B’a)y = (1 — P)?(Ba’), (6.152) 
which can be written as 


1 
ee = (Ba')p, oe — G—pPp? (6.153) 


and by inverting the binomial transform 
1 Dp 
al, = (Bey @ dy = Ga 1) (Be’)y = ((B’)“*c), (6.154) 
where (B?)~? is the inverse Bernoulli transform. 
By setting P = 1/2, so that a, = (—1)?a,,c, == 2°? c,we find: 
(BP a), = 2-°(Ba’), (6.155) 


which shows that the binomial transform can be viewed as a special case of the Bernoulli 
transform. 
Extending the result in equation (6.144), we multiply both sides by any function g(x): 


g(x)(1— x)? = rep) (O) enix! (6.156) 
and take expected values to find: = 
by = y (?) (-1)'a; = Ba), (6.157) 
where now: ™ 
dy = Elg(x)x?], bp = Elg@)A—x)?] (6.158) 


Appendix 6-II: Relationships between different moment categories 


For mean p + 0, the classical central and noncentral moments can, respectively, be written as: 
p x\?P ; x\P 
Hy = B[(x—H)"] = wre |(a—2) |, a) = Blx?] = 278 [(F) | (6.159) 


Thus, the sequences p,/(—p)? and u,/y?are related to each other through the binomial 
transform (equation (6.144)), i.e.: 
i Dp ; ul! Dp 
BNNs 2) (Mie 
i=0 1=0 
The latter simplify to: 
Dp Dp 
=> Ceo, m= Clara (6.161) 
i=0 i=0 
Similar relationships can be obtained between central and noncentral K-moments. In 
particular, from (6.7) we find 
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q 


Kyq = (P- 4+ DE|(F(x)) | (x-0)" ]=@-44+ DE|(F(z)) a ee) Guts 


i=0 
q 


=) OQcor'e-a+ ve [(F@)” “=| 


i=0 
and finally 
q 


“=D (—u)?"K, p-qti,i 


i=0 
which can also be written in a binomial transform form as 
q 
ae) (Deo 
(Ht £ 
i=0 
The inverse transform, after algebraic manipulations becomes 


q 


Koq = >, (Fat Kp aris 


i=0 
In particular, for gq = 1: 
Ko = Kye Ky = Ky +u 
and for q = 2: 
Kpy2 = Kp2 — 2uKp-11 + we, Kyo = Kyo + 20g 44a + ee 


(6.162) 


(6.163) 


(6.164) 


(6.165) 


(6.166) 


(6.167) 


To find the relationship of noncentral to tail-based K-moments, we start from their definitions 


—I/ 


Kpq 
p-qt1 


writing them in the form: 
I 
Kpq 


srari"le"(P@) I. 


Setting j = p — q we have: 


=B|x7(1- F(x)) "| 


—/ 


wild E [x4 («(x))']. nate B[x*( (1- F(x)) | 


(6.168) 


(6.169) 


which, combined with (6.157), indicate that the left-hand side parts are related through the 


binomial transform: 
K. J 
qtiqa i Kasia q 
ee 1 
jt+1 me ( ie ) i+1 
i= 


Hence: 





J 
Kasjq = yeaa 1)'K, qtig = =U Ge 1) Kosa 


i=0 i=0 


which, after the necessary manipulations, yields (6.23) . 


To find the relationship of central to hypercentral moments, we proceed as follows: 


(6.170) 


(6.171) 
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Kjq = 2°-4(p — q + 1)E|(F(x) - 1/2)” “(x - n)"| 
p-q : 
= 1 E meee 
=2P-4%p-q+DE| > | %)(-5) F@Pr eo" = 
i=0 
p-q pet a ane (6.172) 
=2?-"(p-q+1)) ( i )(-5) E[F(x)” *(x-p)"| = 
i=0 
es p-4q 1" p-qt1 
= ara) ( i ea, p—q-iti Pia 
i=0 
This results in 
p-q aa ‘ WK 
_ 9p- _ 2) (p-qt ; = iG LV pata 
Ky, = 2-4 » ( 5) (Ce ig. . sas -Ye yi (Po ath) eke 6.173) 
i= 
where the inverse is 
Wea $f i A 
tm CP OHE te Mn SCOURS ay 370 


i=0 i=0 
Appendix 6-III: Relationships between K-moments of continuous and mixed 
distributions 


—l/* 
The tail-based K-moments K, ofa distribution function with a discontinuity 1 — P, at the origin 
are related to those of the distribution without a discontinuity by: 


p 
K, = PPK, = =-) C1) (i) (1—P,)'R; (6.175) 
1=0 
On the other hand, the noncentral K-moment of order p is: 
Pp p p i ; 
te eee Dame) Ce Cee 
i=1 i=1 ae (6.176) 


“Ye 1! Dis vi ( 


Using the identity: 


MO=MC-) (6177) 


we find: 
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m= YevR reo Me? ; )ei=Se 1)! ( aXe ni (P71) pi = 
l=1 


pl 


“Ye 1)! C )K iene ual (6.178) 


i=0 


“Ye yt! ( aXe ial 


and finally: 
p 
Ik Pp I _ 
Kit = (7) Kip — Pye (6.179) 
=t 
Regarding the approximation of equation (6.38), we define p, so that it correspond to p’ = 2. 
Thus, K,, = Kj. Approximating K,," for p < p, with a power function as in the upper case of (6.38), 


we determine b as the logarithmic slope: 
ue In(K3/P,K1) = In(K3/Kj) — InP, = cln2—InP, = In(2°/P,;) (6.180) 
In De In De In pe In pe 


where c := In(K3/K,)/In 2. On the other hand, if we assume that p’ = P,p,thenat p’ = 2,2 = P,p, 
Or p,; = 2/P,. Performing the algebraic operations we find (6.38). 


Appendix 6-IV: Proof of estimator unbiasedness 


The density function f(j.n)(x) of the ith order statistic, x(j.n) is (Papoulis 1990): 


fen) = (i+ 1)(;" 4) (FD) (A - FD)" FO) (6.181) 
and therefore: 
E[xkm|= | x fem @ddx 
=(n-i+1)(,",) i x4 (F(x) (1— F(x)” f(x) dx (6.182) 


1 
=(n—-it+1) ( “ ) [ wy)" F'-1(1 — F)" "dF 
0 


Further, the above expected value is related to the K-moments as follows: 
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1 n-i : 
Exim] = @ i+ D(; 4) fey" ae ') (—1)/ FidF = 
0 j= 


-i 1 


=(n-i+1) ( ii ‘ia Ce ‘ (—1)/ [ wy)" Fiti-lt+a-aqr = (6.183) 
0 


3 


— 
ll 
Oo 


n-t 


n n= Ky pies q 
=I a4) ae 
iC). j Je Sere 


=0 


~ 


Now, we define the noncentral K-moment estimator: 


Rt = 


q 
pq Pinpg X(in) (6.184) 


Ms: 


I 
ay 


U 
and we seek coefficients binyg which make it unbiased. We recall that the ordering of the sample 
is meant in terms of x and not x4 (<i = (xti:ny))- This makes equations (6.182) and (6.184) 


consistent to each other. 
The expectation of the estimator (6.184) is: 


E[Kpq] = 


Ms 


DinpgE [a4 


II 
Ma Ms 


I 
ay 


1 
bing —i+1)(," 4) | ((x(F))*FP-4) F-Pt, — FY" (6,185) 
0 


n 


((x(F))*FP-4) XG —it+1) ( 7 ry Bisa Fi-1-pta a F)"-'dF 
i=1 


| 
on ses—e 


If we choose: 


(",) 0 i<p-q 
(Ht 1) a boa =| = oo : _ (6.186) 
oe (p-—qt1) a i nae a 


then the sum in (6.185) is drastically simplified, i.e., 


Dinpg (n —it 1) é _ a) Fi-l-pt+acy = Fy"! 


n 
1=1 


=(p-—q+t1) y (oe eC oa, 


i-—p+q-1 
Sea ee (6.187) 
n-pt+q-1 
=@w-a+y Yo (PTE pia — pyre 
j=0 


n-pt+q-1 


=(p-—qt+1)(F+(1-F)) = (p—q+1)(1) 


and, consequently: 
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1 
E[Riq] = (p-q+1) [(@@ytr-4) dF = Khq (6.188) 
0 


In other words, to have an unbiased estimator of a noncentral K-moment it suffices to choose: 


es 7 
-—qt+1\i- — —qt1im-pt+q-1)!(G-1)! 
ee ee ear a rae (6.189) 
n—-it1 (. ) n (i-—p+q-1)!(n-1)! 
i-1 

and this can be generalized for non-integer p > 0, i.e., 

Dinng =4P~Gt1TM@—-—ptq TO ae (6.190) 

n l'(n) Tii-pt+q)’ pa 


where we can notice that only the last of the three multiplicative terms depends on i. 
If we denote bjn, (without the last index q) the coefficient for q = 1, i.e.: 
0, i<p 
binp =P TH — P+ 1) P'(i) Ss (6.191) 
n I'(n) rd-p+1) 
then it can be readily verified that 
binng = binp—q+1 (6.192) 
On the other hand, combining (6.184) and (6.189) and taking expected values we find: 


_—— 
Kpq = 


p—q+1(m-p+q-1)! (i-—1)! x’ (6.193) 


| nm (n—1)! G@—p+q—1)) tim 
t=p-—qt1 


Multiplying and dividing the right-hand side by (p — q)! we get: 


n 
p=qr Dl aAaprg—1)! (=4)1 
a. Sy) @aaealles pee — ee 
aes n! (@-g)!Gi-pt+q-1! [zim] (6.194) 
i=p-qt1 
which can be written as 


n 


(pq +1) Koa = | 2 (; a E (283 (6.195) 
i=p-q 


Appendix 6-V: Properties of the simplified K-moment estimators 


For the case examined, the necessary unbiasedness condition of equation (6.69) takes the form: 


n 


*». (F(xem)) =4 (6.196) 


i=1 
As already stated in section 6.10, the estimator (6.81) precisely fulfils this condition for p < 4, i.e. 
the difference: 
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n . p-1 
a(n,p) ==)" ees) (6.197) 
nha 2vn? —1 , 


is precisely zero. For higher p it is close, but not precisely equal, to zero. Specifically, it is smaller 
than 0.05%, for moment order as high as p = n/10. Beyond n/10 and up to n/2, it reaches 1%. It 
is not difficult to evaluate O(n, p) from equation (6.197) and then divide the K-moment estimate 
by O(n, p) + 1 to counter the deviation. Furthermore, a very accurate numerical approximation of 
O(n, p) is: 
0, ps4 
O(np)=4 1 é — 3.5\? n (6.198) 





Here we stress that (6.197) represents a necessary but not sufficient condition for unbiasedness. 
For that reason, the estimator (6.81) should not be applied for p > n/2 as, even after correcting 
by dividing by O(n, p) + 1, the bias is still present. 

In contrast, the approximate estimator (6.83), which does not take into account the smallest 
sample values (i.e., the x(n) values for i < p) can be used for any p. In this case the bias O(n, p) 
takes the form: 
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Figure 6.18 Deviation - 0(n, p) in fulfilling the necessary unbiasedness condition of the estimator 
(6.83), as a function of sample size n and K-moment order p. Dash-dot lines correspond to the 
specified values of p as fractions of n. 


This factor is estimated either by direct application of (6.199), or by its theoretical evaluation 
though the generalized Riemann zeta function {, i.e.: 


p(f(1 — p,p — a(n)) —F(1 —p,n+ 1-a(n))) 


oe) n(n — b(n))?-1 


-1 (6.200) 


where a(n) and b(n) are given by (6.82), or even by the following numerical approximation, 
which is close to accurate: 
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0, p=1 
O(n, p) = a Ne -( 1 =) (6.201) 

oS 1+—-—-—J]JeP™ >1 

~24\n TOR ge UR 
The bias for the estimator (6.83) is depicted in Figure 6.18. It can be seen there that the 
deviation is negligible (< 0.05%), for moment order up to p = n/10, very small (< 1%) up to n/2, 
small (< 5%) up to p = n-4, and increases rapidly thereafter. It is not difficult to evaluate O(n, p) 
from equation (6.197) and then divide the K-moment estimate by O(n, p) + 1 to remove bias. 


Appendix 6-VI: Derivation of equations for the effect of autocorrelation 


It is convenient to determine first the central K-moment K$ and then the noncentral one, which 
will be Ki? = K$ + w. For K$ we may assume that the stochastic variables have been transformed 
to normal distribution with zero mean and unit variance. As our derivations are approximate, we 
neglect the effect of that transformation to the autocorrelation. Assuming that the correlation 
coefficient of the variables x;,x; is r,; and using known results for normal variables (Nadarajah 


and Kotz, 2008; see also Appendix 5-II), the probability density of y;; = max(x;, x;) will be: 


gas 
fy) = 2fO)F | ——=y (6.202) 


The expectation of y is then easily evaluated to: 


oa a 1 _ T; . 
eed = E[max(x;, x;)| = [ »6,0r¢y = a (6.203) 


We assume that X;,X; are two terms amongst those in the sequence X1,X2,...,X,, and that the 
process has autocorrelation function 7; = r(i — j). For i > j, there are (n — 1) ways to allocate i 
and j so that t :=i—j = 1, (n— 2) ways to allocate them so that t :-= i — j = 2, etc. Thus, the 


average of all Ke will be: 


n-1 
—r(t) 2 
Ks >) a no 0-9 ICEStP mG (n—1) (6.204) 


As r(t) < 1, this can be approximated as: 
n-1 og 
ey ce ee eee eles _ 
K3 Taaaoe r(t)/2) (n—-T) as ercenny ae. (n—t) (6.205) 


For an independent process r(t) = 0 and K, = 1/V. Thus, if we define the adjustment coefficient 
as: 
04(n; r(t)) = Me MG 4 (6.206) 
n;7r(T)) = im : 
this will take the form of equation (6.92) 
For a Markov process, in which r(t) = r* equation (6.92) yields: 
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2r 1i-r-r"/n 


@™(n,2) = ———_.5 ——__—_ 6.207 
(n, 2) Aaa? na ( ) 

which results in (6.93). For an HK process, where r(t) ~ H(2H — 1)t?~?, (6.92) yields: 

H(1 — 2H) = uy 
HK ~ do EY (7-28) (2-2H) 
OP aay (Henne) (6.208) 
A rough approximation of the generalized harmonic number qo is (Lampret, 2015): 
ni-t@_— J 
UO St tagger (6.209) 
1+4+I1nn, a=0 


With this approximation, after the algebraic operations and approximations we find (6.94). 
From section 6.14, equation (6.113) we have: 


1 
Fy(Ky 
where for the normal distribution A, = 2 and A, = eY = 1.781, whilst y is the Euler constant and 
the meaning of the A-coefficients are explained in section 6.14. For our approximation we will 
initially neglect the difference A, — A,,, thus introducing some error for small p, which we will 
revoke later. In this case: 





= 1 = 1 
Pu(Kp) * 7 Pu(Ky') * 7 (6.211) 


where p’ is such that: 
Ky = Kf = (1+ O)Kp (6.212) 


where for convenience we have simplified the notation 9" (n, H) to @. From (6.212) and (6.211), 
solving for p’, we find 


1 = —-1/ 1 
— x AoFy| (1+0)F (—) (6.213) 
D N ( N \Anp 

Now we use the approximation of the normal distribution function derived in Appendix 5-II 


(equations (5.44) and (5.45) for the distribution function and the quantile function, respectively) 
and find: 


a est @| {41 (SP) +4 1 2(1+6)1 (=?) =C 6.214 
eg Ol n(-S ) n()})=c@) 214) 
Calculating the log-log derivative of C(p) we find 
s 0 
COS) de (6.215) 
V4In(Awp/2) +1 


For p > 0, C*(p) > —(1+ 0)”, which does not depend on A,,. This allows simplifying the 
approximation (6.214) as: 


p’ = Ap(@+9)") 41-4 (6.216) 


APPENDIX 6-VII: DERIVATION OF LIMITING A FACTORS 227 


for some constant A, where the term 1 — A in the end was added so as to give p’ = 1 for p = 1, thus 
recovering from the error introduced by neglecting the difference A, —A.. By numerical 
investigation it was found that the constant A = 1 — 20 makes the approximation satisfactory, 
thus resulting in equation (6.96). 


Appendix 6-VII: Derivation of limiting A factors 


With reference to section 2.19 on the relationship of parent and extreme value distribution, 
combining equations (2.117) and (2.121), for sufficiently large threshold u we find that for a 
distribution function that belongs to the domain of attraction of the Extreme Value Type I 
distribution, the following approximation holds for x = u: 


F(x) = F(u) + F(x|x > u)(1— F(w)) = Fu) + (: — exp (—-) e rc) (1— F(w)) (6.217) 


Inverting F we find that 





1- 
x(F) =u -Aln( 


= a F>E,:=F(u) (6.218) 


while for F < F, the quantile function is unknown, say xy(F). From equation (6.19) we find 


1 Fy 1 
Ky — p| x(F) FP-1dF = p| x(F) FP-1dF +p fe FP-1qdF =A+B+C (6.219) 
0 0 Fy 
where: 
Fy Fy 1 


A p| x(F) FP-1dF, RS -»{ xXy(F) F?-1dF, Ce p| x(F) FP-ldF — (6.220) 
0 0 0 


As x(F) is a non-decreasing function with lower limit x(0) = w+AIn(1 — F,) and upper limit 
x(F,) = u, we have for the term A: 


Fy Fy 
r| (u+Aln(1— F,)) F?-1dF <A< p| uFP-1dF (6.221) 
0 0 
or 
(ut+Aln(1—F,))FP < A<uk? (6.222) 


The unknown xy(F’) should also be a non-decreasing function with lower limit, say, c (assumed 
finite as happens in hydrometeorological variables, e.g. c = 0) and upper limit xy(f,) = x(F,) = 
u. Thus, we have for the term B: 


Fy Fy 
-»{ uFP-1dF<B< -» | ¢F? dF (6.223) 
0 
or 
—uF? < B<-—cFP (6.224) 


The term C evaluates to 
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1 
hls = 
C= p| (u-ain(z —))F dF =u+A(Hy + In(1 —F,) (6.225) 


~ Pu 





0 


Now, combining all above we find: 


C+(Aln(1—F,))Fy S$ Kp $C + (u-c) FP (6.226) 
or 
= p ' =3 p 
ei (6.227) 


As p > ©, clearly C > ©, while both the lower and upper limit in the above inequality tend to 1 
(notice that F, <1 and thus F? > 0). Thus, as p > ©, K,/C > 1 and by virtue of (6.225), the 


following approximation holds: 


Ky =u+A(Hy + In(1 - F,)) (6.228) 
From (6.217) we find: 
F(Kp) = 1—-exp(—Hp) (6.229) 
and from (6.109) we obtain: 
ae expt) (6.230) 


The last relationship holds true precisely for the exponential distribution for any p and at the limit 
as p — © for any distribution belonging to the domain of attraction of the Extreme Value Type I 
distribution. This limit is evaluated to: 


Ag = e¥ (6.231) 


Appendix 6-VIII: Explanations for the approximation of A-coefficients 


Equation (6.112), ie.: 
1 
A eA Noo) = (6.232) 


provides a first approximation of A, for any distribution but can be improved. While it captures 
the initial and final values, A, and Aq, it does not reflect the rate at which Ay tends to A,,, which 
differs in different distributions. We quantify this rate through the difference 


Ay = Ay — Aco (6.233) 


If we approximate A, with equation (6.114), the same difference will be given by: 


dy p20 
AAR =e Rin Bp’ (1+ +) peel (6.234) 
aes (p+1)8 -1/]’ 1p ps0 , 
where the superscript ‘A’ stands for “approximation”. It can be easily demonstrated that for any 
BER: 
AAS — 0 


prow 
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The rate at which AA} tends to zero is described by the following asymptotic properties (given 
here without proof): 





= =0 
Inp’ oe 
BB 
~ 5IBI" 0< |p| <1 
A 
AA; a A-p'B (6.235) 
PNB = 1 
Pp 
[B| >1 
p’ 
Furthermore, we easily find for p = 1 that: 
A-Bl (1 + : ) =0 
ie In2/’ pr 
aad = {4— Bln +B), [B|=1 (6.236) 





A-Bln (* (1 + a :)) otherwise 


Now, for each particular distribution function we should find first the parameter f from the 
asymptotic properties of the function and then match AA} and AA, for p = 1andp > utilizing 
the above two equations. A systematic study of several distributions, using both theoretical and 
numerical analyses, gave the results listed in Table 6.6. 


Appendix 6-IX: The invariance of A-coefficients under linear 
transformations 


Considering the linear transformation w=bx+c, the distribution of w is K,(w) = 
F((w — c)/b) and the density is fy, (w) = f((w — c)/b)/b. Thus, 
p-q p-q 
Kivog = P- 9+ DE|(R(w)) wt] =@-a4+ DEl(F(w-o)/b)) — w4] 
p-q 
=(p—-qtE [(F@)) (bx + c)*| (6.237) 
p-q qd 
= (p- q+ 1)b%E| (F(x) (x + c/b) | 
This means that Kwq depends on b and on the ratio c/b, and its determination requires the 
expansion of (x + c/b)"through the binomial coefficients (2), i = 0,...,q. Thus, the result will be 
an involved expression that will contain a weighted sum of K-moments of x, Kpi fori = 0,...,q. 


Note that the case q = 1 is an exception as 


Ky, = pbE (F@)~ (x + c/b)| = pbE (F@)~ x| + peck (e@)” | = DK, +c (6.238) 


The situation is much simpler for the central moments, i.e.: 


230 CHAPTER 6 - KNOWABLE MOMENTS AND THEIR RELATIONSHIP TO EXTREMES 


Kwoq = (p —qt 1)E [(Fw)) (w = by)4| 


=(p—4q+DE|(F(w-6)/b))  (w- Hw)" 


; (6.239) 
= (- 4+ DE[(F(@))” (bx +6) - x +0)" 
=(p—qt1)b1E (Fe) (x- 1)"| 

and finally, 

Kwyq = >" Kg CoA, 

Furthermore, 

- ((Kpg) tp ) =H (Gay sie c) jo) =F (Ki + 1) (6.241) 

Hence, 

Noes = Ang (6.242) 


which proves equation (6.131). 


Appendix 6-X: Approximation K-moments of a normal variable 
For a probability density that is an even function we have f(x) = f(—x), F(x) :=1—-—F(x) = 
F(—x), x(F) = —x(1 — F). Hence, using (6.19), we find: 

1/2 


1 
Kpq = Kpqg = (P-9 +1) { x(F)4 FP-IdF + [ wy)" rsa) 
0 1/2 


1/2 1/2 
=(p-qt+1) { x(F)4 FP-4dF + | (x1 -A))* A - ors] 


0 0 


1/2 1/2 (6.243) 
=(p—qt+1) { x(F)¢ FP-IdF + | (—1)4(x(F))* cz pra 
1/2 ; 
=(@-q-+ 1) { CR) CRE ASC oe) o 
0 
For q = 1 this becomes 
1/2 
K, =K, =p { x(F) (FP-1 — (1 — F)?) w (6.244) 
0 
Let 
1/2 
By =p { x(F) FP-1 o (6.245) 
0 


Then 
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1/2 p-1 1/2 
p | «a- F)?-1 dF |= SO Nc 1)* [ «rtar 
° 2 y ; (6.246) 
=A)! 
Sy Tae @nkapi DMB =~ apa i" 
and hence 
Kp = By + >. (i) (—1)*By (6.247) 


For the normal distribution B, can be approximated using (5.45), i.e., 


1/2 
3 3VTe?/* erfc(,/p/2 
By © D | (V1 = 4In(@2F) — 1) F?-* dF | = so ee) (6.248) 


0 
To find the A-coefficients we use the approximation (5.44) noting that Ky, > 0 for any p > 0: 
Z 


Ay SS 
2% Zoey 6.249 
pexw(—3 Ky (4 + 3%) ( ) 


This proves equation (6.132). 


Chapter 7. Stochastic simulation of hydroclimatic processes 


7.1 Desiderata of a simulation scheme 


In several instances in the previous chapters, we had to deal with problems that do not 
admit an analytical solution. A most promising alternative for such problems is the 
stochastic (or Monte Carlo) simulation, which has been introduced in section 2.6. If the 
processes we had to deal with could effectively be modelled as white noise, then the 
simple random number generators presented in section 2.6 would be enough for our 
simulations. However, hydroclimatic processes are characterized by several behaviours 
which we need to respect and reproduce in our simulations, both qualitatively and 
quantitatively. While here we avoid to provide all details about these behaviours and to 
review the variety of methods devised to deal with them, we provide a rather simple 
generic scheme that can be used in most problems related to extremes of hydroclimatic 
processes. In Digression 7.A we also discuss non-conventional types of stochastic 
simulation, by conversion of deterministic models into stochastic. 

Before we discuss simulation schemes per se it is useful to summarize the 
characteristic behaviours. 

Periodicity. When the time scale of interest is finer than annual, hydroclimatic 
processes exhibit seasonality, related to the annual motion of Earth around the Sun. In 
addition, when the time scale of interest is finer than daily, some of those processes may 
exhibit regular diurnal variation, related to the daily rotation of Earth. The most 
appropriate technique to deal with these regular variations is to build a so-called 
cyclostationary model, with single or double periodicity, depending on intensity of the 
periodic variation and its effect for the very problem of interest. In a cyclostationary 
model the parameters of the nth order distribution function vary according to periodic 
(apparently, deterministic) functions of time. 

Here we will not discuss the rather sophisticated methods of this category, but we will 
resort to simpler methods in which only the first-order (marginal) distribution function 
of the process is dealt with. We list the following techniques of this category of 
approximate methods, from the most to the least complex. 


e Nonlinear transformation of the process by “season” and/or “hour”, where 
“season” is a part of the year (e.g. one or more months) in which the seasonal 
variation is no longer substantial and likewise for “hour” (which may mean one or 
more hours); the standard transformation of this type is a transformation making 
the distribution standard normal (normalization). 

e Linear transformation or else standardization of the process, usually expressed as 
y, = Ge - Mr) /Oq, where pL, and o;, are periodic functions of the time 1; this is 


followed by modelling the process and recovering of x, by applying the inverse 
transformation. 

e Proportional adjustment (or linear mapping), which is similar to the linear 
transformation expect that there is no subtraction, i.e., y, = x,/a,, where a, isa 
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periodic function of the time T; the advantages of this technique are its parsimony 
and avoidance of negative values in the case that the process is nonnegative (e.g. 
rainfall). 

e Null case (or do-nothing) is an option when the periodicity entails a negligible 
effect on the problem we study; an example where the null case is applicable is 
discussed in Digression 6.B. 


Dependence (and particularly long-range dependence). The omnipresence of 
dependence in natural processes is a sufficient reason to replace the classical IID statistics 
with stochastic processes. Short-range dependence has been the basis of using ARMA- 
type models, but these prove inadequate for many natural processes. Therefore, our 
simulation scheme should be able to reproduce long-range dependence. Dependence is 
typically handled through the second-order characteristics of a stochastic process, while 
long-range dependence is identified through the asymptotic LLDs of the second order 
characteristics (section 3.8). Among them, the climacogram and climacospectrum are 
most useful for the model identification and fitting phases, while for the simulation phase 
the autocovariance becomes also very useful. Preservation of any one of the second-order 
characteristics results in preservation of all other. 

Intermittence. At fine time scales, hydroclimatic processes exhibit intermittent 
behaviour. This is most clear in the rainfall process, where it is quantified by the 
probability dry (probability of dry state). Similar is the situation with the streamflow in 
ephemeral streams. However, intermittence may appear in a less visible manner in the 
streamflow of large rivers with permanent flow, where the state switches between 
baseflow and flood. The baseflow is characterized by its own variability, and therefore a 
characterization by a single parameter, such as probability of the baseflow state, would 
be inefficient. A more general characterization of intermittence can be made in terms of 
high-order moments, starting from the skewness. 

Skewness and high-order moments. At fine and intermediate time scales, most 
hydroclimatic processes have positively skewed distribution functions. The skewness is 
mainly caused by the fact that hydroclimatic variables are non-negative and sometimes 
intermittent. This is not so common in other scientific fields whose processes can safely 
be regarded as Gaussian. Thus, the preservation of skewness becomes important for 
hydroclimatic processes, while in combination with intermittence, proper modelling 
should include preservation of moments of order higher than 3. Unlike the second-order 
characteristics, where simulation schemes are able to preserve joint and marginal 
moments, for orders of > 3 only marginal moments can be dealt with in an explicit 
manner. 

Time irreversibility. In the streamflow process, time irreversibility (the asymmetry 
in time, manifested e.g. with rapid increases followed by gradual decreases) is evident up 
to time scales of several days, while in atmospheric processes irreversibility appears only 
at very fine scales (Koutsoyiannis, 2019b). Irreversibility can be quantified by the 
skewness of the time differenced process xX, := x, — X;_1, which in turn has to be 
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preserved in simulations. This preservation is feasible only if x, per se has a skewed 
distribution, while processes with symmetric distribution are also time symmetric. 

Spatial variation and dependence. Hydroclimatic processes evolve both in time and 
space. Typically, simulation deals only with the temporal evolution. The most precise 
mathematical representation of hydroclimatic processes can be achieved extending the 
index set of the process from one dimension (representing time) to three dimensions (one 
for time and two for space). However, multidimensional modelling is not easy and has 
been implemented only in few cases. A midway solution, which is more common in 
applications, is to use multivariate models, which describe the temporal evolution of the 
process simultaneously at a number of points. Thus, instead of having a vector index set 
T in x,, we vectorize the process state x, keeping Tt scalar. This vectorization type can also 
be directly used to model more than one cross-correlated process (e.g. rainfall and runoff) 
at the same location simultaneously. In the remaining of the chapter we will deal only with 
scalar processes with scalar index set; the reader interested about multivariate or 
multidimensional processes is referred to Koutsoyiannis (2000) or Koutsoyiannis et al. 
(2011), respectively (see also Dimitriadis et al. 2019; Sargentis et al., 2020). 


Digression 7.A: Non-conventional stochastic simulation incorporating 
deterministic models 


Deterministic models have been widely used in hydroclimatic processes. In many cases their use 
has been very effective in providing reasonable predictions yet they suffer from the fact that they 
neglect uncertainty, which in inherent in such processes. 

It is possible to convert a deterministic models into stochastic and perform stochastic 
simulation to assess the uncertainty. A relevant technique, sometimes called an ensemble method, 
is to shift from one to many applications of the deterministic model. Each simulation is performed 
after stochastically perturbing either input data, model parameters, model output, or all of them. 
In particular, perturbing the model error is done by adding random outcomes from the population 
of model errors, whose probability distribution is conditioned on input data and model 
parameters. Montanari and Koutsoyiannis (2012) have provided a blueprint of this approach 
which was further applied in a data-driven mode by Sikorska et al. (2017) and further advanced 
by Papacharalampous et al. (2020a,b). 

In a different approach, deterministic model outputs can be converted to stochastic by 
connecting a (single-run) deterministic output to a stochastic model of the process using a 
Bayesian framework. Such an approach, accompanied with hydroclimatic applications, has been 
studied by Tyralis and Koutsoyiannis (2017). 


7.2 Simple discrete-time processes of the Time Series School 


The simplest of the processes of the Time Series School have been already described in 
section 3.11 and Digression 3.E, where it has also been explained why more complex 
models of that kind are not recommended. Instead of using complex time-series models, 
it is preferable to follow the general methodology summarized in the next sections. 
Nonetheless, when there is no persistence (or antipersistence) in the process of interest, 
the simple models, which are contained in Table 7.1 along with all equations needed for 
their application, are convenient and readily applicable to simulate a process x, by 
filtering white noise v, as indicated in the second column of Table 7.1. 
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Table 7.1 Equations of the simplest models of the Time Series School and their characteristics 
(see also section 3.11 and Digression 3.E). 











Name Process equation Equations for second-order Equation for marginal Eqn. 
characteristics moments of any order no. 
Co(1 = a’) = a? (v) (3.77)- 
AR(1 = AX;z-14 + — aq? = 
(1) Xp = AXp1~tYy c= altle, (1 — a? ) Uy = Up (3.78) 
Co = A4C, + anc, +. 0? 
0 1*°1 2°2 v (3.79)- 
AR(2) Xp = AyXq-4 $F AQXz-2 T Vp Cy = AyCy t+ AQCy (3.80) 
Cy = Ay Cy-1 + AgCy-2,N 21 : 
Co = ac, + (1 + ab + b?)o? eas 
ARMA(11) 2; = xX + Up + bv. Cy = acy + bo? (1=aP)y = (1+ bP? 
C, =a" *e, n21 , 


All models of Table 7.1 can reproduce the marginal mean and variance of the process, 
while the AR(1) and ARMA(1,1) can also reproduce marginal moments of higher order. In 
terms of characteristics of the joint distribution, the AR(1) model can reproduce the lag- 
one autocovariance, while the other two models can, additionally, reproduce the lag two 
autocovariance. The model parameters a and b can be determined by solving the 
equations of the third column of Table 7.1, while the high-order moments of the white 


noise process can be preserved by specifying the high order moments of v,, ie So as to 


satisfy the equations of the fourth column of Table 7.1. 
7.3 Generic simulation method for any stochastic structure 


To simulate the discrete-time stochastic process x, with any autocovariance function 
Cy we can use the generalized moving average scheme (Koutsoyiannis 2000): 


J 
xr = > ajVr-j (7.1) 
j=] 
where aj are weights to be calculated from the autocovariance function, v; is white noise 


averaged in discrete-time (and not necessarily Gaussian), also known as innovation 
process, and J is theoretically infinite, so that in all theoretical calculations we will assume 
J = 00, while in the generation J is a large integer chosen so that the resulting truncation 
error be negligible. Here we stress that the above scheme is just the contrary to the 
schemes of the Time Series School. Specifically, (a) we use a purely moving average 
scheme without any autoregressive term and (b) we do not relate our scheme with 
observations, as the observations have been already used in the model fitting phase, 
which is totally isolated from the generation scheme. 

Writing equation (7.1) for x,,,, multiplying it by (7.1) and taking expected values we 
find the convolution expression for J = 0: 


co 


Cy = >. Aj An+1 (7.2) 


l=—00 
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We need to find the sequence of ayn = ---,—1,0,1,..., so that (7.2) holds true. The 
following generic solution of the generating scheme, giving the coefficients a,, has been 
proposed by Koutsoyiannis (2020a): 


1/2 
a= e27i(O(@)—) AR(w) dw (7.3) 
-1/2 
where 9(w) is any (arbitrary) odd real function (meaning 6(—w) = —8(w)) and 


AR(w) := J 2sq(@) (7.4) 
As proved in Koutsoyiannis (2020a) the sequence of a,: 


(1) consists of real numbers, despite the expression in (7.3) involving complex 
numbers; 

(2) satisfies precisely equation (7.2); and 

(3) is easy and fast to calculate using the fast Fourier transform (FFT). 


This theoretical result is readily converted into a numerical algorithm, which consists of 
the following steps: 


(a) From the continuous-time stochastic model, expressed through its climacogram 
y (k), we calculate its autocovariance function in discrete time (assuming time step 
D): 


(n + 1)? (In + 1D) + (y — 1)?y((In — 1D) 

i a, aa 

(This step is obviously omitted if the model is already expressed in discrete time 
through its autocovariance function.) 

(b) We choose an appropriate number of coefficients J that is a power of 2 and perform 
inverse FFT (using common software) to calculate the discrete-time power 
spectrum and the frequency function A®(w) for an array of w; =jWyj = 
0,1,...,J,w, = 1/JD: 


—n?y(\n|D) (7.5) 


J 
Sa(@;) = 2c9 + 4) c,cos(2mw;),  A®(a;) = _[2sq(a,) (7.6) 
n=1 


(c) We choose 0@(w) (see below) and we form the arrays (vectors) A® and A!, both of 
size 2J indexed as 0,...,2J - 1, with the superscripts R and I standing for the real 
and imaginary part of a vector of complex numbers, respectively: 


[A®]; = re (2n8(«;)) /2, j=0,..J 


(7.7) 
[A®]2)-;, (STE aig 1 
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—AR(a,) sin (2m0(«;)) /2, f=0,J-1 
[A']; =40 j=l (7.8) 
—[A']2)-;. j=Jtl,..,2J-1 


(d) We perform FFT on the vector A® + i A! (using common software), and get the real 
part of the result for j = 0, ...,J, which is precisely the sequence of ay. 


We note that by choosing J as a power of 2, the vectors A® and A! will have size 2J which 
is also a power of 2, thus achieving maximum speed in the FFT calculations. (More details 
are contained in a supplementary file in Koutsoyiannis, 2020a, which includes numerical 
examples along with the simple code needed to do these calculations on a spreadsheet). 
It may be useful to note the following additional points about the method: 


e Equation (7.3) gives nota single solution, but a variety of infinitely many solutions, 
all of which preserve exactly the second-order characteristics of the process. 

e A particular solution is characterized by the chosen function 6(w). 

e Even assuming 0(w) = 6, sign w with constant 0p, again there are infinitely many 
solutions. 

e The availability of infinitely many solutions enables preservation of additional 
statistics (e.g. those related to time asymmetry; see section 7.5). 

e In addition, we always have several options related to the distribution of the white 
noise v,, which in general is not Gaussian, thus enabling preservation of moments 
of any order (see section 7.4). 


The special case 8(w) = 0 gives a symmetric solution with respect to positive and 
negative 7: 
1/2 


A’(w) = AR(w) = V2sq(w), ap = | y 2sq(w) cos(21w) dw = a8, (7.9) 


where the superscript S stands for symmetric. This has been known as the symmetric 
moving average (SMA) scheme (Koutsoyiannis 2000). All other solutions denote 
asymmetric moving average (AMA) schemes. An interesting special AMA case is obtained 
for 0(w) = 1/4signw (or 2m0(w) = 1/2 signw). This corresponds to an antisymmetric 
AMA scheme (ANTAMA) with: 


1/2 


A4(w) = A®(w)8(w) + iA®(w), a = 6)+ | VJ 2Sq(w) sin(21w) dw (7.10) 


where the superscript A stands for antisymmetric, 5(w) is the Dirac delta function, and 


— v2sa0) Y (7.11) 


~ 
~ 


o"~ 2(2] + 1) 2o+1 





with 69 approaching zero as J becomes large. Any other case of constant 0) (where 0(w) = 
8, sign w) can be expressed in terms of the above two limiting cases through: 
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Ay = 5p + (a3 — 5p) cos(218y) + (a4 — 5p) sin(274y) (7.12) 


For example, the case 0) = 1/8 (or 216, = 1/4) yields the interesting result: 


a, =~ (a§ + af) ~ (V2-1)4, (7.13) 


A most common solution is the ordinary backward AMA (OBAMA) scheme in which 
a, = 0 for any 7 < 0; this latter is typically formulated in a different manner and denoted 
as simply moving average—MA, but since here we study a richer family of schemes, we 
use the distinct acronym OBAMA. A constant 60 does not give a precise OBAMA and 
therefore a non-constant function 6 (w) is needed in this case. A generic analytical solution 
of 0(w) that would give a precise OBAMA is not simple (this problem is known as factoring 
of the power spectrum; see Papoulis 1991, p. 402). However, solutions for simple special 
cases are not too difficult to find (e.g. for rational spectra; Papoulis 1991, p. 402-404; 
Koutsoyiannis (2020a) for the Markov process in continuous time, as well as for the 
ARMA(1,1) process, including its special cases AR(1) and MA(1)). 

However approximate OBAMA solutions can be found rather easily. First, if for some 
6, and for n < 0 it happens that a, ~ 0, then it can be verified that: 


0, n<0 
Gy a> cos(2m89) + (1 — cos(2m89) — sin(2m89))do, n =0 (7.14) 
V2a§ + (2 —V2)dp, n>0 


Such a sequence with almost zero coefficients for negative n, will be close to the OBAMA 
scheme. It is interesting to notice that in this approximate solution only ag depends on the 
constant 99, while for 7 > 0 the coefficients are approximately equal to those in the SMA, 
multiplied by V2. 

For stochastic structures with LRD, this OBAMA scheme approximation may not be 
satisfactory and a better approximation can be found by adopting a parametric expression 
for 8(w) and optimizing its parameters (see examples in Koutsoyiannis, 2020a). 

The method is illustrated in Figure 7.1 using two example processes. The first is the 
Markov process, whose basic properties are shown in Table 3.5. The second is the FHK-C 
model defined in equation (3.87), which gives its climacogram, whilst all its other 
characteristics are evaluated through the equations listed in Table 3.3. Specifically, Figure 
7.1 shows three special cases, SMA (equation (7.9)), ANTAMA (equation (7.10)) and 
OBAMA for the two processes. For the OBAMA case and the Markov process the solution 
plotted is exact, while for the FHK-C process the sequence of a, is an OBAMA 
approximation. Some slight (rather invisible) deviations from zero are present in the left- 
bottom panel, which at a later step will be set to zero and the small resulting effect will be 
further handled as a truncation error (in the manner described by Koutsoyiannis, 2016) 
to obtain an exact OBAMA scheme. 

All in all, this simple method of the AMA scheme renders ARMA-type models (including 
all their variants) unnecessary, particularly because of the generic, analytical and fast 
solution it offers. Here it is important to stress that, while optimization of coefficients 


240 CHAPTER 7 - STOCHASTIC SIMULATION OF HYDROCLIMATIC PROCESSES 


involved in the function 0(w) could sometimes be required, it is not necessary in general. 
Any odd real function 6(w), chosen arbitrarily, will give a, that will satisfy equation (7.2) 
(apart from a truncation error) and thus can directly be used in generation. Even if the 
sequence of 0 (w;) is constructed at random (e.g., as a sequence of random numbers in the 
interval [0,1/4]), again equation (7.2) will be satisfied and the resulting a, can be directly 
used in generation. 
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Figure 7.1 Illustration of the symmetric (SMA), antisymmetric (ANTAMA) and ordinary- 
backward (OBAMA) cases of the generic AMA model for (upper row) a Markov process and 
(lower row) an FHK-C process. The parameter values are a = 10, A = 1 (in both processes), H = 
0.8, M = 0.7, and the number of weights is 2049 (J = 1024 = 21°); (left column) coefficients a; 
(right column) autocovariance function. (Source: Koutsoyiannis, 2020a.) 


Digression 7.B: A simple analytical solution for the HK model 


According to the algorithm presented in section 7.3, to calculate the series of coefficients a, we 
need to perform the discrete Fourier transform (preferably in its FFT variant) twice, the first time 
to find the power spectrum of the process sq(w) and the second time to determine a, from the 
vector A® + iA!. Generally, these transformations are performed numerically. However, the HK 
process allows analytical calculations. Specifically, there is an explicit analytical SMA solution 
(Koutsoyiannis, 2016): 
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i+ 1 H+0.5 + _— 1 H+05 
eee nee) (7.15) 


a, = [Hs ( ; 


where b(H) is a function of the Hurst coefficient H. For H > 0.5, the proximity of the power 
spectrum of the averaged process with that of the continuous-time process (equation (3.86)) 
allows the theoretical derivation of a consistent expression of b(H), i.e. (Koutsoyiannis, 2016): 
21°(2H + 1) sin(mH) y; 
l2(H + 3/2)(1 + sin(mH)) 
For H < 0.5 the proximity is not good and thus equation (7.16) does not perform well. However, a 
very good approximation, valid for any H, is Koutsoyiannis (2002, 2016): 
Z( Vid) Zl) 
(3/2 —H)2 +0.2(1/2—H)2 (3—2H)? 


b(H) = (7.16) 


b(H) (7.17) 


7.4 Preservation of high-order moments 


The AMA and the SMA schemes allow preserving moments of any order by the method 
outlined below. In most applications, preservation of moments up to the fourth order 
gives adequate representation of hydroclimatic processes, as illustrated in Dimitriadis 
and Koutsoyiannis (2018) and Koutsoyiannis et al. (2018). It should be stressed that in 
typical sample sizes, high order moments should be evaluated theoretically through the 
distributional parameters (see Table 2.3) rather than estimated from the data, as their 
sample estimates are unreliable (Lombardo et al. 2014). 

To more conveniently deal with moments of order > 2, we utilize the properties of 
cumulants of independent variables, and particularly homogeneity and additivity. The 
cumulants are directly determined from moments and vice versa (equation (2.36). For the 
pth cumulant, by virtue of (2.38), these properties result in: 


J 
Kp = > a? x”) (7.18) 
i=-y 
where kK, and a are pth cumulants of x,and v,, respectively. Solving for ee we find: 
@) ___p 
Ky = i, 7 (7.19) 


Based on the above discourse, we can formulate the following steps of a general 
simulation strategy, starting from the observed data (noting that alternative modelling 
strategies can be seen ina series of references provided by Dimitriadis and Koutsoyiannis, 
2018, and Tsoukalas et al., 2018): 


1. We construct the climacogram and climacospectrum, and we choose a suitable 
model of second-order dependence. 

2. We fit a theoretical model on the climacogram and climacospectrum, and estimate 
the Hurst parameter H, with appropriate provision for fitting issues, such as bias. 

3. We estimate K-moments for q = 1 (and possibly 2), and we choose a marginal 


distribution for the process based on K-moments and possibly relevant theoretical 
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considerations (e.g., entropy maximizing distributions), with appropriate 
provision for bias. 

Based on the model parameters (for the marginal and joint distribution) we 
calculate theoretically (and not estimate from data) the classical moments of the 
process of interest. 

From equation (2.36) we calculate the cumulants of the process of interest. 

From equation (7.19) we calculate the cumulants of the white noise process and 
from (2.36) we calculate its moments. 

We calculate the linear filtering coefficients a, from equations (7.5)-(7.14) 

We choose an appropriate distribution for the white noise, calculate its parameters 
theoretically from its moments and generate a random sample with the required 
length. 

Filtering with equation (7.1) we synthesize the simulated series for the process of 


interest. 


In the case of a tail index € > 0, the moments and cumulants of x, of order > 1/€ will 


be infinite, and hence those of v, will also be infinite. The moments involved in the 


modelling framework and the manner they are treated are summarized in Figure 7.2. 
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Figure 7.2 Schematic of the moments involved in stochastic modelling and the manner they are 
treated. It is assumed that we wish to preserve classical moments up to order i « n. 
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7.5 Preservation of time irreversibility 


Time irreversibility can very easily be handled within the AMA framework. Assuming that 


the white noise v, (in discrete time T) has variance 1 and coefficient of skewness Em: we 


will have: 
J J 


> a (7.20) 
j=-J j=] 
where /3[X,| is the third moment of the process x;. Its coefficient of skewness will be: 


3 [xr] = ee a; c™ 


aa (var[x,])”” 7 eg a2)" i 


Time asymmetry is quantified through the skewness of the differenced process <X,, 


var[x;] 


II 
2, 
z= 
w 
— 
tay 
ake 
II 


(7.21) 


which by virtue of (7.1) is written as: 


J 
x, = X_ — Xr-1 = ». (a; = Aj—1)Vr-j (7.22) 
j=] 
with a_;_, = 0. Thus, its skewness will be: 
a J 3 
e - —_bsl&e] i= - G1) wy 


s=— Se eee S$ (7.23) 
(varfze))*” (SI (a; -4)-4)’) 


The ratio: 
7 3 3/2 

Cs _ mer), oe aj_1) (Eee a; 

C. Oe yee 

: (xi__,(a — aj-1) ) dja) " 


is independent of GY) and primarily depends on 8(w), which determines the sequence of 


(7.24) 


Ay. The case 0(w) = 0, i.e. the SMA, results in complete time symmetry. However, a 
constant 6) # 0 (appropriately chosen) can make the ratio Cs/Cs as high as we wish, thus 
enabling preservation of time asymmetry. 

The above results make it clear that without skewness in the original process x, (e.g. 
in the case of Gaussian processes), there cannot be time asymmetry. 


Chapter 8. Rainfall extremes and ombrian modelling 


8.1 From ombrian curves to ombrian models 


One of the major tools in hydrological design is the ombrian relationships, more widely 
known by the misnomer rainfall intensity-duration-frequency (IDF) curves. An ombrian 
relationship (from the Greek ‘6uBpos’, rainfall) is a mathematical relationship connecting 
the time-averaged rainfall intensity x over a given time scale k (sometimes incorrectly 
referred to as duration) for a given return period T (also commonly referred to as 
frequency, although frequency is generally understood as reciprocal to period). Several 
forms of ombrian relationships are found in the literature, most of which have been 
empirically derived and validated by the long use in hydrological practice. Attempts to 
give them a theoretical basis have often used inappropriate assumptions and resulted in 
oversimplified relationships that are not good for engineering application. 

Usually the ombrian curves are constructed for time scales of some minutes to several 
hours. This range of time scales has been dictated from engineering needs. However, with 
just a few adaptations of ombrian curves we can have a complete and decent stochastic 
model of rainfall, an ombrian model. The adaptations needed are basically two: an 
extension of the temporal coverage for large time scales and a more consistent theoretical 
formulation, in connection to the stochastic concepts we have already developed. From a 
practical point of view, it is not a long way nor is a big effort required to move from the 
ombrian curves to an ombrian model. And once we have the model, we directly get the 
ombrian relationships ready for engineering application. 

But do we really need a stochastic model? And if we do, why not choose one of the many 
stochastic rainfall models of the literature? An easy reply to the second question is that of 
course we can choose any available model, but a two-in-one solution, a theoretically 
consistent model and a practical tool, both in one expression, is a better choice. Coming to 
the first question, our answer is positive for two reasons: 


e While in traditional engineering design, the ombrian relationships are directly 
used in calculations, current hydrosystem configurations, which are increasingly 
complex, may require stochastic simulation, which is allowed by modern 
computational facilities. Stochastic simulation enables determination of risk at the 
end component of the hydrosystem, which actually is at risk, without relying on 
common simplifying assumptions, such as the equality of probability of occurrence 
of rainfall and flood discharge. 

e As already discussed, estimation from data always involves bias and uncertainty, 
whose determination requires a model. Both bias and uncertainty become 
substantial when there is persistence. As we have already seen in several 
examples, this is the case with the rainfall process and, as we will see in this 
chapter, even the very common expressions of the relationship of rainfall and time 
scale are suggestive of persistence. 
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We will now discuss the basic postulates of an ombrian model, recalling that a model 
is always an approximation of reality and needs several assumptions to construct. 


1. A basic desideratum is that the end result should be readily used in typical 
engineering tasks even without resorting to simulation. It should thus be as easy 
to use as traditional ombrian relationships. To this aim we could sacrifice perfect 
theoretical consistency, if this results in too involved expressions. 

2. On the other hand, a basic requirement of any stochastic model is to handle and 
preserve first and second order characteristics of a process. As in ombrian 
relationships the variable of interest is the temporal average intensity x“) over 
any time scale k, it is natural to base our model on the climacogram, i.e. var[x], 
recalling from Chapter 3 that by preserving the climacogram we preserve any 
other second order characteristic. The need for preserving a constant mean is self- 
evident, even though as we will see (Digression 8.B), no particular interest has 
been given in this requirement in common expressions of ombrian curves. 

3. The process variance should be finite for k > 0 (otherwise it will not be physically 
consistent; see section 2.17) and zero for k — 00 (otherwise the process will not be 
ergodic; see section 3.4). 


4. The model should incorporate the fact that the probability dry, a = Pix = 0) 


is nonzero for small time scales. This means that the probability wet, pe = 


F”’(0) =1- pi) is smaller than 1 for small k, including for k —> 0. 

5. As the emphasis of an ombrian model is on maxima, moments of order higher than 
two are important to consider. 

6. In particular, the tail index of the distribution for all scales should be constant for 
all time scales. Theoretical justification of this requirement is given in Appendix 
8-I. 

7. Because of its simplicity and explicit relationship between the time-averaged 
intensity and return period, the Pareto distribution is an optimal choice for small 
time scales; its suitability has been already verified in examples of previous 
sections. 


In Digression 8.A and Digression 8.B we see that most of these requirements are 
violated in common ombrian relationships. 


Digression 8.A: Inconsistencies of common ombrian relationships 


The most common expression of ombrian curves (in particular in fractal-oriented studies) is: 


AT? 


= (8.1) 


x 


where 1,é,7 are parameters, all positive numbers and <7 <1. Apparently, it is not 
dimensionally consistent but this can easily be remedied by introducing the parameters a and B 
with units of time and rewriting (8.1) in a dimensionally consistent manner, as: 
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_ Ar /By* 


where J’ := ABs /a”. 
According to equation (5.50), the return period T = T“ is associated to time scale k and 


related to the latter by T/k = 1/ me (x). Hence: 


Teas 
Oe ey 
FO (x) =1 re: - (8.3) 


This is not a proper probability distribution function as for x = 0, F(x) = —0o. Also, for k = 
0,F (k) (x) = —oo, irrespective of x, which again is an inconsistency. 

Another ombrian relationship has been proposed by Koutsoyiannis et al. (1998) and refined 
in Koutsoyiannis (2007): 


(T/D)§ —y' 


— avayn 


(8.4) 


where D is a time unit, typically 1 year. At first glance this looks consistent with most of the 
requirements of section 8.1. In section 8.3 we will derive it in a slightly different form as a 
simplified ombrian model. As we will show, it is not free of inconsistencies, yet for small time 
scales is a good approximation of our consistent model, and can be useful in engineering 
application. 


8.2 Building an ombrian model 


To build a proper model in agreement with the postulates or, at least, without severe 
violations of the requirements set in section 8.1, we make the following assumptions: 


1. Pareto distribution with discontinuity at the origin for small time scales: 
x \~1/E 
FO(x) = 1- P (1 =) 8.5 
As explained in section 8.1, the tail index € should be constant, while the probability 


wet, pe, and the state scale parameter, A(k), are functions of the time scale k. 
2. Continuous PBF distribution with discontinuity at zero for large time scales, i-e.: 


-1/F 


In this case a new parameter ¢(k) is introduced, which is again a function of time 
scale. The Pareto distribution is a special case of (8.6) for ¢(k) = 1. In contrast to 
the Pareto distribution, whose density is a consistently decreasing function of x, 
the PBF tends to be bell-shaped for increasing ¢(k), a property consistent with 
empirical observation and reason. 


3. Constant mean of the time-averaged process: 


E[x®] =n (8.7) 
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4. Climacogram of type FHK-C (equation (3.87)), i-e.: 


H-1 
2M\' M_ 


or of type FHK-CD (equation (3.90)). This has six parameters in total. To avoid an 
overparametrized model we set both time scale parameters equal and, as we 
expect H > 0.5 due to persistence and M < 0.5 due to roughness, we set M = 1 — 
H, thus getting: 


2H-2 


var{x] = y) =a (1+ “) +4,(1-(1+ =) (8.9) 


Clearly, in both cases, y(k) — 0, as k > 00, which makes the process ergodic; for 
k = 0, y(0) = Yo = A, in the case of (8.8) and y(0) = ¥p = A, + Az in the case of 
(8.9). In both cases y(0) is finite and the number of parameters is four. 


5. Probability wet and dry, pe =1- pe. varying with time scale according to: 
InP = InP (k/k*)9, k= (8.10) 


where k* is the transition time scale from Pareto to PBF distribution, for which 


pe > Oand ¢(k*) = 1 (for continuity of the transition), and 0 is a parameter (0 < 
8 <1). This equation has been derived in Koutsoyiannis (2006a) based on 
maximum entropy considerations.” As we will see, in the Pareto distribution, the 
probabilities dry and wet are derived directly from the distribution, and thus no 
equation additional to (8.10) is needed. The transition time scale k* is chosen at a 
point where the deviation of probability dry derived from the Pareto model from 
the empirical one is marginally acceptable. 


Both the decreasing (Pareto) and the bell-shaped (PBF) types of probability densities 
are consistent with natural behaviours for small and large time scales, respectively. It 
must be noted though that the tail index of the PBF distribution in the form of equation 
(8.6) is not € but €/¢(k) and tends to zero as k > oo. Thus, with equation (8.6) we have 
sacrificed the requirement of a constant tail index, but this violation happens only for 
large time scales. The alternative to keep (8.6) and replace € with €¢(k), thus recovering 
a constant tail index € is not an option because it would result in a finite variance as k > 
co (with a coefficient of variation €/,/1 — 2€), i.e., in a nonergodic process. There is also 
the alternative to replace (8.6) with the distribution of the sum of correlated Pareto 
variables. Actually, there is an analytical solution for this (Arendarczyk et al., 2018, albeit 
not for distribution with discontinuity at zero) but it is quite complicated and would 
severely violate the basic desideratum (point 1 of section 8.1). Therefore, we deem that 


* A slight modification of equation (8.10), i.e., In pe =In pe) +I1n (Bee) (k/k*)®, where po is an 
additional parameter representing the probability dry of the instantaneous process, with value close to 1, 
can cover the entire range of scales. However, here the assumption of the Pareto distribution for small scales 
renders the additional parameter superfluous. 
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the sacrifice of the constant tail index for the very large scales, which usually are not of 
interest in engineering practice, is unimportant. 

What it remains to complete the model is to determine the functions A(k) and ¢(k) 
from the mean py and the climacogram y(k). In Appendix 8-II we derive these functions as 
well as approximations thereof which are sufficiently good and much more practical in 
application: 


ese = orre 1) 8.11 
may ~ JC 20)(# eS aa 


i pe fh 1 i 


——— 


1a \"* GH Gey Gao)” 


These correspond to the PBF distribution. In the Pareto case, ¢(k) = 1, and hence 
(8.11) can be used to derive the probability wet as: 


2 


(8.12) 


2 as Lu 


1 Tir + we ore 





while (8.12) simplifies to: 


1 pi F 
A) wa—-8 /2-Hey® +e» 


Note that in the Pareto case, the equations are exact. The special case pe = 1 signifies 


(8.14) 


the maximum time scale kj,4x, at which the Pareto distribution is mathematically feasible, 
at which: 








VGknas) _ 1 


(id 
P| = 1, re Ts 


A(kmax) = HC — §) (8.15) 
However, if we are interested in preserving the probabilities dry/wet according to 
equation (8.10), we should choose the time scale k* (of transition from Pareto to PBF) 
smaller enough than kyyax- 

The PBF distribution is feasible for any time scale, even when pe = 1 which actually 
is the case for large scales. In that case, equation (8.11) simplifies to: 


1 _ (0-276 fates 
C(k) Lu 
By setting T = 1/(1 — F™)(x)), the ombrian expression resulting from equation (8.6) 
is: 


1 


(A T/k) 4 C(k) 
5 


x=A(k) (8.17) 
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and for ¢(k) = 1 (Pareto), it simplifies to: 


ae T/k)’ =f 
. 


For € = 0 the PBF and Pareto expressions switch to Weibull and exponential, respectively, 


x = ACK) (8.18) 


i.e.: 


x = A(k) (In( pi) r/k) YF, x = A(k) In( pi) T/k) (8.19) 


Recapitulating the above discourse, our ombrian model gives directly the ombrian 
curves in the form of (8.17) and its special case (8.18) for the Pareto distribution, applying 
on small scales, which are of greatest interest from an engineering point of view. These 
relationships rely on the mean yp, the climacogram y(k), the probability wet pe and the 
tail index € of the distribution function of rainfall intensity. These relationhips are 
reproduced in Table 8.1. 


Table 8.1 Mathematical relationships of the ombrian model. The ombrian curves per se are given 
in the last two rows. 





Small scales, k < k* 























Quantity (Pareto)! Large scales, k > k* (PBF)! 
E[x®| r 
(k) omic kyr ae 
y Ai(1 + (k/a)?") mM or a, (1+) +4,(1-(1+5) ) 
(ie aa Ss (kk) 
Hs 1/2 —Ey(k) + p? La( ih? ) 
! (i 3 
3) 1 (1 = 28)(P (Ce) /u? + 1) - 1) 
== Be ee rere ae 
ae Ka §) #\  a-aG@)’ (a0)" 
1 
(k) oe psp) — 5) 
xforé >0 Ak) (Fi i t A(k) ae 
1 
xfor§=0 x =ACk)In( Pp T/k) x = ACK) (In( B® r/e) 





1 The transition time scale k* is the time scale at which the empirical probability wet p© deviates from the 
expression given for the Pareto distribution; values are typically of the order of 10 to 100 h. Note that for 
k > k*, the probability wet becomes 1; this simplifies the relationships for the PBF distribution for very 
large scales. 
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Our ombrian model offers: 


(a) mathematical and physical consistency; 

(b) coverage of all time scales, from zero to infinity; 

(c) good behaviour on the very fine time scales, through the fractal parameter M; 

(d) good behaviour on very large time scales, through the Hurst parameter H and the 
mean u whose effect becomes important as time scale increases; 

(e) simultaneous treatment and preservation of the climacogram; and 

(f) simultaneous treatment and preservation of the probability dry/wet. 


The ombrian model uses a total of seven parameters listed in Table 8.2. This number is 
greater than that in the conventional ombrian curves, which is typically five. If the data 
cannot support the estimation of seven parameters, this number can be reduced by using 
default values (e.g. 9 = 1, M=0.5). 

It is useful to note two inequality relationships among the parameters which would be 
useful in the model fitting phase. The first is implied by equation (8.13) and the fact that 
pe < 1; as the domain of the Pareto distribution extends up to the transition time scale 
k*, the following should hold. 


2 


U 
és 77 2k) (8.20) 


Furthermore, in order for (8.11) to be valid, the following inequality should hold for the 
entire domain of the PBF distribution, i.e. for any k = k*: 


p® (‘“" + 1) >1 (8.21) 


Table 8.2 Parameters of the ombrian model. 








Parameter Meaning of parameter Related tool Related equation 

m Mean intensity Mean, pL (8.7) 

A4,A2 Intensity scale parameters! Climacogram, y (Kk) (8.8) or (8.9) 

a Time scale parameter Climacogram, y (kK) (8.8) or (8.9) 

M Fractal (smoothness) parameter? Climacogram, y(k) (8.8) 

H Hurst parameter Climacogram, y(k) (8.8) or (8.9) 

6 Exponent of the expression of Probability wet, p®) (8.10) 3 
probability dry 

é Tail index Probability distribution, F(x) (8.5)-(8.6) 





1 One or two parameters for the cases that the climacogram is given by (8.8) or (8.9), respectively. 

2 The fractal (roughness/smoothness) parameter M is an independent parameter in the case that the 
climacogram is given by (8.8), while ifit is given by (8.9) itis assumed M = 1 — H. 

3 The expression also includes the transition time scale k* but this is not regarded a parameter but a 
modelling choice. 


8.3 Model simplification for small time scales 


The Pareto ombrian expression in equation (8.18), which is applicable only for small 
scales k < k*, can be written as: 
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(503) | 


x = 2(k) (8.22) 
§ 
where £(k) is a function of time scale with units of time, i.e.: 
__k 
B(kK) = ro) (8.23) 
1 
By virtue of (8.14) we will have: 
1/2 - k) +p? EMS 
5 Oe =) m(( -1) (8.24) 
Su B(k) 


Now, we make the simplifying assumption P & k, which can stand as an 
approximation for small k; hence: 


B(k) = B = constant (8.25) 


Then (8.24) can be written as: 


(1/2 — (kK) +H) ( Ay 
x = ——————|[[s] -1 
cu B 
Further, by noting that for small time scales y(k) >> u?, we can neglect the latter term in 
their sum. Assuming a climacogram in the form (8.8) and taking the neutral value M = 


1/2 as default, we find: 


x= (1 + am (( "y = (8.27) 


We can now see that, thanks to the simplifying assumption (8.25), the rainfall intensity 
x is determined as the product of a function of time scale k and return period T. This 
facilitates calculations and particularly the parameter estimation. We can write this 
property in a more concise form as: 


(8.26) 


ar) 
*= AT) 





(8.28) 


where we have changed the product to quotient in order for both a(k) and b(T) to be 
increasing functions of their arguments. The function a(k) is: 


k UT] 
a(k) = (1 + =) A= 2S2H (8.29) 
The parameter A and the function b(T) are for € > 0: 
1/2-€)a T\* 
yo SR Ot  pay=(Z) -1 (8.30) 
cu B 


and for € = 0: 
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rao, b(T) =1n(5) (8.31) 


This simplified ombrian relationship, comprised of equations (8.28)-(8.31), has five 
parameters in total, falling in three categories, namely: (a) A with units same as x (typically 
mm/h); (b) a and f with units of time (typically in h, even though it may be convenient to 
express f in years); and (c) the dimensionless € and n (0<é<05,0<7< 1). 
Comparison of the parameters of the ombrian model and the simplified ombrian 
relationship is provided in Table 8.3. Interestingly, the parameter 7 is related, through 
equation (8.29), to the Hurst parameter, which is H = 1 — 7/2. Clearly, any value of n < 1 
results in H > 0.5, i.e., a process with persistence. Only the case 7 = 1resultsinH = 0.5, 
but empirical evidence does not support the value 7 = 1. For typical values of 7 = 0.5- 
0.7, the resulting Hurst parameter H is 0.75-0.65. However, this is not a proper way to 
estimate the Hurst parameter because equation (8.29) is an approximation good for small 
scales, while the Hurst behaviour should be assessed on large time scales. 


Table 8.3 Comparison of the parameters of the ombrian model and the simplified ombrian 
relationship. 











Ombrian model Simplified ombrian relationship 
Parameter Meaning of parameter Parameter Meaning of parameter 
L Mean intensity 
A4,A2 Intensity scale parameters A Intensity scale parameter 
a Time scale parameter for k a Time scale parameter for k 
B Time scale parameter for T 
M Fractal (smoothness) parameter? 
H Hurst parameter n Exponent of the expression of the 
6 Exponent of the expression of time scale function a(k) 
probability dry 
é Tail index é Tail index 





Digression 8.B: Limits of the simplified ombrian relationship 


For €> 0, combining equations (8.28)-(8.30) we can write: 


es 
pe CEB) 1 (8.32) 
(1+k/a)” 
By comparing it with equation (8.4), we see that the two equations are mathematically equivalent, 
with the parameters a, é and 7 being identical in the two cases, and the remaining two related by: 


{e=w"D,  A=a'p'}e {y= (B/D), a" = a(D/8)} (8.33) 


As the particular form of equation (8.4) has been widespread (for example, in Greece the ombrian 
curves of all country have been expressed in this form), equation (8.33) is useful for conversion 
between the two forms in engineering application. 

Solving equation (8.32) for T we find the expression of the distribution function of mean 
intensity x at time scale k as: 


254 CHAPTER 8 — RAINFALL EXTREMES AND OMBRIAN MODELLING 


k Ei ete 
Cee pe eee eee ee as ; 
6p) il zo! a(1+3(1+2) ) (8.34) 


This latter indeed reflects a Pareto distribution with a discontinuity at zero, which is: 


k 
pee = F®) (0) =1- po ={- 1 (8.35) 


In this respect, at first glance it is consistent with respect to point 4 of section 8.1. However, for 
large k, this probability may become negative, which is a mathematical inconsistency. In addition, 
ifk = 0, noe = F) (0) = 1, which means that only the value x = 0 is allowed. This is also an 
inconsistency. 

Furthermore, it is easy to find that its mean and squared coefficient of variation, are: 


EA k _ i= & 
(1-H +k/a)" Bp’ oe al Wee a 





E[x®] = (8.36) 


Both these expressions signify inconsistencies with respect to points 2 and 3 of section 8.1. The 
mean is clearly an increasing function of time scale, tending to infinity as k > oo, while it should 
be constant, and becoming zero if k = 0, whichis absurd. The squared coefficient of variation may 
become negative for large k, tending to —1 as k > ©9, which is absurd as a square of a real number 
cannot be negative, and to +00 ask > 0. 

However, if we restrict k so that k < 6 and hence the probability be reasonable (Gane < 1), then 
we can easily infer from (8.36) that C, [x] > 1/(1 — 2&) > 0. In other words, the simplified 
ombrian relationship has reasonable behaviour for time scales sufficiently smaller than 6, even 
though the constant mean condition will always be violated. Furthermore, to avoid an absurd 
behaviour close to k = 0, we should also restrict k from below. A safe lower bound is the smallest 
value at which data were available and were used in the construction of ombrian curves. 

To get a more specific quantified view of the above, we use as an example the ombrian 
relationship of Greater Athens (Kephisos River basin). This was derived by Koutsoyiannis et al. 
(2010) using data of time scales from 5 min to 48 h and assuming a validity for time scales 5 min 
to 100 h. For altitudes up to 200 m, the ombrian expression is: 


Se (8.37) 
SS Ca 


with x in mm/h, T in years and k in h. Using equation (8.33), we can express this in the form of 
(8.32) with parameters A = 126 mm/h, a= 0.17 h, 6 = 325h, €= 0.15 and n = 0.77. Using equations 
(8.35) and (8.36), we can calculate the probability wet, the mean and the coefficient of variation. 
These are plotted in Figure 8.1 for time scales covering three orders of magnitude, from 0.1 h (6 
min) to 100 h. The probability wet and the coefficient of variation have a reasonable behaviour in 
this range, as 100 h is much smaller than f =325 h. The constancy of the mean is violated, but this 
is not a severe limitation for this range of scales, which is dominated by the variance rather than 
of the mean. 

Summarizing the above discourse, the simplified ombrian relationship (equations (8.28)- 
(8.31)), can give an acceptable approximation of the ombrian relationship for a range of time 
scales of three orders of magnitude, provided that we have data in that range to fit this model. If 
we want to go to a wider range of time scales, or if we want to perform tasks other than direct 
application of the ombrian relationship—e.g. stochastic simulation—then we should use the full 
model of section 8.2. 
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Figure 8.1 Mean, coefficient of variation and probability wet derived from the ombrian relationship of 
Athens. 


8.4 Data availability and processing 


For a reliable estimation of ombrian curves, it is important to utilize all available rainfall 
observations on all time scales. Modern rainfall measuring devices are sensors which 
readily provide digital information at small time steps (e.g. 10 min) but the older 
mechanical autographic devices should never be neglected, even though digitization of 
the archive of their recording charts is tedious. 

It has been a common practice to base the construction of the ombrian curves of a 
certain area on the data of subdaily time scale only. However, this is a problematic practice 
that leaves out important information. As first noted in Koutsoyiannis et al. (1998), the 
(usually much longer) daily rainfall observation records can be fully utilized for a more 
reliable model fitting. 

It is a strong suggestion of this text to combine and use the entire data sets of all types 
of devices and, as explained in Digression 2.K, work on the parent distribution rather than 
extracting values over threshold or, even worse, time-block (e.g. annual) extremes. It is 
noted though that in some cases the availability of data is such that does not allow access 
to the full information. For example, in several cases only a few of the extreme rainfall 
events have been digitized while the majority of rainfall recordings remain in charts. In 
such cases adaptation of the ombrian model may be required, followed by conversion of 
the final results so that they correspond to the parent distribution, which is the natural 
basis for estimating design quantities. In particular, if the data used are block maxima, 
then the Pareto parent distribution corresponds to EV2 distribution of extremes. 
Therefore, the model fitting should be made on the EV2, rather than the Pareto, 
distribution but the final model should be formulated for the Pareto distribution, using 
the parameter values that were estimated for EV2 distribution (see more details in section 
2.19 and Digression 2.K). 
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Some of the model parameters are more sensitive to the availability of data of a 
particular temporal resolution, as summarized in Table 8.4. Thus, the reliable estimation 
of those parameters depends crucially on the availability of the particular resolution on 
which is more sensitive. 


Table 8.4 Crucial sensitivity on particular temporal resolution of the parameters of the ombrian 
model and the simplified ombrian relationship. 





Temporal resolution Parameters that are most sensitive to the data type 





Sub-hourly a (Time scale parameter for k), M (Fractal/smoothness parameter) 


Sub-daily 6 (Exponent of the expression of probability dry), n (Exponent of the 
expression of the time scale function a(k)) 


Daily and higher yt (Mean intensity), € (Tail index), H (Hurst parameter), 6 (Time 
scale parameter for T), A,, Az (Intensity scale parameters) 





Digression 8.C Do we need a sliding window and a Hershfield coefficient? 


When studying a process on multiple scales (e.g. to infer the climacogram of the process), we 
aggregate the available data from several time series to different time scales. No particular 
provision is made for the starting point for aggregation of each time series. To make this clearer, 
let us assume that we have a daily time series x, and from this we construct the 2-day time series 


2), Actually, we can construct two different time series 2), depending on the selection we have 


made for the first term. Namely, the x) that contains the daily term xz could be either (x, + 
X2)/2 or (X2 + x3)/2. Likewise, if we construct a time series at time scale 10, there are 10 variants 
(the first term xo) that contains the daily term x; 9 could be anyone among (x, + -*- + X49)/10 
through (x19 + °:: + X49)/10). All these are numerically different time series. But their statistical 
characteristics are precisely the same. Since we are doing stochastics and we are interested on 
the statistical characteristics (and not on time series values) all options are equivalent. Thus, the 
notion of a sliding window is unnecessary for our study. 

However, when studying extremes, it has been a tradition in hydrological practice (e.g. Linsley 
et al., 1975, p. 357) to use a sliding window and take the maximum value among the variants. In 


the above example, instead of constructing a time series 2) whose first term would be, e.g., 


eM a(x, + X2)/2, we use the notion of a sliding window to construct the time series ye? whose 
first term is: 


yy = max{(x, + X2)/2, (x2 + X3)/2} = (xy + max{x,, x3})/2 (8.38) 


Let us consider the ratio of the expectations of the two cases, E[y|/£ Eel If we 


temporarily assume a fully random process, then this ratio can be expressed in terms of 
noncentral K-moments as: 


Be cence 1, & 


Bn DR 2 iG 











(8.39) 


Thus, in an exponential distribution of the intensity, in which K;/K; = 1.5 (see Table 6.3), we will 
have a ratio equal to 1.25. As we increase the time scale k beyond 2, the corresponding ratio 


E [| JAS ee will converge fast to a value slightly higher than 1.25. However, because of time 
dependence this coefficient turns out to be much smaller. This decreased value, empirically 
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estimated from rainfall data, is usually termed the Hershfield coefficient after Hershfield and 
Wilson (1957) who first studied it and noted “Jt has been determined that, on the average, the 
maximum rainfall in any consecutive 60-minute period is 13 percent greater than the clock-hour 
rainfall for the same frequency for the corresponding period of record at most stations. Similarly, 
and by coincidence, the same factor applies to daily rainfall; to convert observation-day rainfall for 
a particular frequency to the maximum 1440-minute rainfall for the same frequency, multiply by 
1132 

The value of 1.13 has been extensively used worldwide and several later studies confirmed, 
rather than invalidated it. Specifically, studies of maxima on multiple time scales have been based 
on the series yi), rather than oh), determined as above. For the lowest available time scale, k = 
1 (the time step of the original series) as this method can no longer be applied, the values ys) are 
calculated by multiplying x, by 1.13. 

However, this tactic distorts the stochastic behaviour of the process x) which is to be studied. 


As clarified above, the ombrian model is a stochastic model of the average intensity x) at any 


time scale k. The quantity yi is something different from x) and there is no need to study it at 


all. Therefore, the notion of the sliding window is not recommended to use. A fixed time window, 
with any arbitrary starting time is what is actually needed, without any conversion of the original 
time series, except taking temporal averages at several time scales. For consistency, a fixed, rather 
than sliding, window should be used even in extracting block maxima. 


8.5 Ombrian model fitting 


Assuming that the ombrian model parameters are known, we can determine theoretically, 
based on the equations grouped together in Table 8.1, the following quantities: 

e the climacogram as a function of time scale y(k); 

e the probability wet as a function of time scale, Pi”, and 


e the rainfall intensity as a function of the time scale and the return period, x(k,T). 


On the other hand, from the available data series, each one referring to a specific time 
scale k, we can determine empirical estimates of: 
e the standard climacogram estimate /(k) using equation (4.23); and 


e the probability wet, po = fi, /n, where fi, is the number of nonzero observations 
and 7 is the total number of observations in the time series of observations (where 
both #, and n depend on k). 


If we have both the model and the data series, then for each series referring to a specific 
time scale k, we can make tables of empirical values of intensities, x and corresponding 
return periods, T, in two ways: 


e based on K-moments, or 
e_ based or order statistics. 


For the approach based on the K-moments (Chapter 6), we can implement the 
following algorithm for each specified time scale k. 


1. We calculate the theoretical probability, pe. from the number of observations n 


in the sample we specify n, = PYOn and we choose the n, largest values from the 
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time series for further processing. (In a perfect model fit, n, = fi, i.e., the observed 
number of nonzero values). 

2. By adapting equation (6.94), we calculate the bias correction factor 0. Since the 
model is not a pure HK model, to estimate 0 we modify equation (6.94) neglecting 
the first term of its right-hand side, which is small, and adapting the second term 
as: 


y¥@) 
O(k,L,H) = Oy) (8.40) 
This is obtained by inspection of equations (6.94) (4.24) and (3.81) 

3. From the equations of Table 6.6 (entries on the Pareto and PBF distributions, given 
the tail index € of the model, we calculate the A-coefficients A, and A,, 

4. We choose a number m of moment orders p ranging (in geometric progression) 
from 1 to n, and for each one we estimate the noncentral K-moment Ry using the 
equations (6.66)-(6.68). 

5. For each order p we estimate the bias corrected order p’ from equation (6.96). 

6. For each p’ we estimate the A-coefficients A, from equation (6.112) and the return 


period from (6.113), which we adapt to the following formula’: 
k Ul 
T(Rp) = ) Ay * 05 (Awp’ + (Ar — Aw)) (8.41) 
Py 


7. To the return period TKS) so calculated, there corresponds a value x = Kp 
repeating this procedure for all p we make a table of empirical values of x and 
corresponding T. 


Alternatively, we can construct the required table using order statistics. In this case, 
the standard procedure of assigning return period to sample values is simpler (section 
5.6) but it does not take into account the persistence, as in the case of K-moments. Here 
we adapt the standard procedure to take it into account by the following quick-and-dirty 
manner for each specified time scale k. 


1. Using the tail index € of the model we calculate the coefficients A and B of equation 
(5.57) and Table 5.5 (formula VI for approximation of the Pareto distribution). 

2. We make a first estimate T of the return period of each nonzero value x from 
equation (5.57) based on the rank i of each sample observation x(j.n), sorted in 


ascending order; this estimate is based on the assumption of independence. 
3. From equation (8.40) we calculate the bias correction factor O(k, L, H). 


“To check the formula, let us consider k = 1 h, p® = = 0.03, and period of observations of 100 years = 


876 600 h, and assume A, = A, = 2. Then n = 876 600, oa PYOn = 26 296; for the maximum value of 
p =n, = 26298, assuming independence so that p’=p, the return period will be T (Ky) = 
(k/P)(Acop’ + (Ay — Aco)) = (1/0.03)(2 x 26 298) = 1753 200 h = 200 years (as expected, because 
Neg =2); 
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4. Based on the simplified approximation (rule of thumb) of the relationship of return 
period and order of K-moment, expressed in equations (6.107)-(6.108), and 
combining with equation (6.96), we estimate the adapted return period to take 
persistence into account as follows: 





~(k) (1+0)? 

Pay . _ 1 cee 

T' =~ min| | 204+(1 20)( Ok poo’? (8.42) 
1 

5. Repeating this procedure for all nonzero X(im) We make a table of empirical values 


of x and corresponding T”’. 


Prior to most of the above calculations, we need to have assumed an ombrian model 
and specify its parameter values. We may start from some guesses and find the final values 
by minimizing the error between theoretical and estimated statistics. Such errors are 
nonlinear functions of the parameters and we need a nonlinear solver to perform the 
minimization. Such solvers are now common in most numerical software platforms 
(including spreadsheets). 

If we wish to optimize the climacogram, then we can formulate the fitting error as: 


By =) w,((In(v(k) = v(L)) = In(P)) (8.43) 
k 


where L is the observation period and w,(k) denotes some weight, which can be chosen 
as a function of k. As the climacogram spans several orders of magnitude, it is advisable 
to compare logarithms rather than actual values. Furthermore, as articulated in section 
4.6, due to the presence of bias, the estimate /(k) is not comparable to the theoretical 
climacogram y(k) but rather to the theoretical expectation of its estimator, based on 


equation (4.24), ie. E [7«)| = y(k) — y(L). This explains the specific mathematical form 
of equation (8.43). By minimizing E,, we can determine the parameters related to the 
climacogram. However, the exponents @ and € cannot be determined from the 
minimization of Ey. 

In a similar manner, we can define the fitting error in the probability wet (or dry) as: 


2 
Ep = > we(k)(P\ — A) (8.44) 
k 


where wp(k) denotes some weight, which can again be chosen as a function of k. As all 


parameters of the ombrian model are involved in the mathematical expression of pe the 
minimization of Ep can determine all parameters. However, as no model is a perfect 
description of reality, this type of specification of parameters, which focuses on the dry 
part of the rainfall process, is not good for extreme rainfall. 

For this reason, it is more advisable to fit by minimizing an error metric focusing on 
distribution quantiles x(k, T) for all available time scales k and a series of return periods, 
as described above. The total fitting error in this case is: 
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Ey = err ae we(T)CeCk,T) — 2,7), (8.45) 


where w,(T) is a weighting factor, determined as a function of the return period T, and nx 
is the number of x values at time scale k. The total square error over the entire set of 
return periods (the second sum in the right-hand side of (8.45)) is further normalized by 
the climacogram y(k). 

A combined optimization would take into account all three error metrics in a linear 
combination with weights Ay, Ap, Ay, i.e: 


E = ay,E, + apEp + a,Fy (8.46) 


This can give a best compromise in a simple manner, even though multivariate (Pareto) 
optimization would make for a more sophisticated approach. 


Digression 8.D: Ombrian model for Uccle, Belgium 


We will illustrate the ombrian model against data using very long observational records. We start 
with the meteorological station at Uccle (a suburb of Brussels), Belgium (50.80°N, 4.37°E, 100.0 
m), which is perhaps the one with longest sub-hourly rainfall record worldwide. It belongs to the 
Royal Meteorological Institute of Belgium (RMIB) and its recording started in 1898. Here the data 
from 1898 to 2017 (not publicly available) with a gap in 2003 (119 years in total) have been used 
at the minimum available time step, which is 10 min, and at aggregate time scales up to 96 h. In 
addition, the daily precipitation record, publicly available through KNMI’s (Koninklijk Nederlands 
Meteorologisch Instituut) Climexp system, by accessing the European Climate Assessment & 
Dataset, has been used!. This covers the period 1880-2018 (139 years) with only very few missing 
daily values, which were left unfilled. The time scales of investigation start from the minimum 
available, i.e., daily, and advance up to 13 years (so that the aggregate time series have at least 10 
values). The sub-hourly and daily records are generally in good agreement with each other up to 
1999 but later there are notable deviations. 

We fit the model simultaneously at both sources of data. We aggregate the data of the original 
10 min time step to time scales of 0.5, 1, 2,4, 6, 12, 24h, and 1, 2, 4d (11 time scales in total). Also, 
we aggregate the data of the original 1 d time step to time scales of 2, 4, 8, 16, 32, 64, 128 d, and 
0.5, 1, 2, 4, 8, 13 years (14 time scales in total). By choosing not to use time scales > 96 h for the 
sub-hourly data and also to use a larger number of time scales for the daily data, we give more 
emphasis on the latter, as daily data are generally deemed more reliable than (sub)hourly, 
particularly on the large time scales. 

Here we use the fitting approach based on the order statistics, as described in section 8.5. For 
each of the time scales of investigation we estimate from the sample the variance (climacogram) 
and probability wet and, once model parameters are assumed, the return period of each nonzero 
intensity value. The form of the theoretical climacogram we use is the FHK-C (equation (8.8)). 

A first model fitting has been based on merely the climacogram, on the basis of equation (8.43), 
assuming equal weights, i.e., w,(k) = 1. In this case we estimated merely the four parameters that 
appear in equation (8.8). Another fitting has been based on the probability wet, on the basis of 
equation (8.44) assuming equal weights, i.e., wp(k) = 1. In this case all seven parameters are 
estimated. However, from an engineering point of view a more useful fitting is that based on the 
total fitting error of distribution quantiles on the basis of equation (8.45). As in the approach 
based on order statistics the low return periods appear much more frequently than the high ones, 
if the weights w,.(T) are set equal, then the fitting emphasis will be given on the small return 
periods. To avoid this, we have chosen an increasing function w,(T), namely w,(T) « VT. Finally, 
a combined fitting on the basis of equation (8.46) has been performed with weights a, = 0.1,ap = 
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100, a, = 1. (Note that the chosen high value of ap counterbalances the fact that Ep is much 
smaller than the other error components.) 

The fitted parameters in all cases are shown in Table 8.5. The optimization cases for quantiles 
and combined resulted in virtually the same parameter values and thus one entry appears in 
Table 8.5 for both. By inspecting the table, we see that the parameter values of py, H, @ and é, 
obtained with different optimization objectives, are fairly stable (do not change much with change 
of the objective function). Notable is the high value of € (0.20, against typical values of 0.1-0.15) 
and the moderate H (0.6). 


Table 8.5 Parameters of the ombrian model of Uccle. 








Case of optimization! u@ (mm/h) AQmm2/h?) a(h) M(-) H(-) OC) €(-) 
Climacogram - 0.281 0.645 0.28 0.58 - - 
Probability wet 0.0905 0.794 0.722 0.20 0.60 0.630 0.200 
Quantiles & Combined 0.0916 1.387 0.140 0.50 0.62 0.573 0.194 
Quantiles for subdomain? 0.2454 2.848 0.250 1 0.56 1 0.123 





1 The transition time scale k* was chosen 12 h. 

2 The mean estimate is 0.0916 mm/h for the hourly series and 0.0905 for the daily series. In the 
quantiles/combined optimization cases the value was derived by optimization. 

3 The subdomain is defined ask < 2d &T > 2 years. 


The empirical and theoretical climacograms are shown in Figure 8.2. The model can obtain a 
perfect climacogram fitting, if the optimization objective is the climacogram per se, but even in 
the combined optimization the fitting remains good. Likewise, as seen in Figure 8.3, the model can 
obtain a perfect fitting on the probability wet, if the optimization objective is this latter, but even 
in the combined optimization the fitting remains relatively good. 

The fitting on distribution quantiles for the combined optimization is shown in Figure 8.4, 
which is close to a typical depiction of ombrian relationships, except for the fact that here the time 
scales span 6 orders of magnitude (10 min = 0.17 h to 13 years = 113 958 h) and the return 
periods span almost 5 orders of magnitude. The fitting is generally good for those impressively 
wide spans of time scales and return periods and thus supports the suitability of the ombrian 
model. 

If we delimit the ranges of time scales and return periods to those used in typical ombrian 
curves, we can obtain an even better fitting. This is illustrated in Figure 8.5 for k < 2d and T21 
year. The ombrian relationships appear in this subdomain as straight lines in the double 
logarithmic plot of Figure 8.5, a fact that is characteristic for the tails of the Pareto distribution 
and has enticed the fans of fractals to perceive this behaviour as the magic of power laws. 

The fitting on quantiles described above has also been used in this case and the resulting 
parameter values also appear in Table 8.5. Interestingly, the parameter values differ substantially 
in this case and, in particular, the mean does not approach its standard empirical estimate, as also 
happens in common approaches of construction of ombrian curves. 

It is noted, though, that a more simplified model (section 8.3) and a simplified model fitting 
method (section 8.6) can be used if we are interested in that subdomain only. The real power of 
the full ombrian model is its coverage of all time scales and return periods, as well as its direct 
applicability in stochastic simulation. 
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Figure 8.2 Fitting of the ombrian model (equation (8.8)) to the empirical estimates of the climacogram for 
Uccle. Note that the bias in the climacogram is negligible due to low Hurst parameter. 
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Figure 8.3 Fitting of the ombrian model (equations (8.10) and (8.13)) to the empirical estimates of 
probability wet (P,) or dry (1 — P,) for Uccle. 
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Figure 8.4 Ombrian relationships as resulted from the ombrian model for Uccle for time scales spanning 6 
orders of magnitude (10 min = 0.17 h to 13 years = 113 958 h). The empirical points are estimated from 
order statistics (using formula VI of Table 5.5) taking into account the effect of persistence. Continuous, 
dashed and dotted lines represent the theoretical values of model, the empirical estimates of daily series 
and those of the hourly series, respectively. The abbreviation “y” stands for year. 
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Figure 8.5 As in Figure 8.4 but with fitting focused on time scales < 2 d and return periods 2 1 year. 





1Climexp section “blended ECA&D”, data access 2020-08-10; data time stamp 2019-05-23. 
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Digression 8.E: Ombrian model for Bologna, Italy 


As already described (starting from section 1.3), Bologna, Italy, has one of the longest daily rainfall 
records worldwide, currently 206 years. Hourly rainfall data of the Bologna station are also 
available but for a much shorter period, 1990-2013, and are provided again by the Dext3r 
repository (retrieved and processed by Lombardo et al., 2019). The total length is 23 years, as the 
entire 2008 is missing. 

Here we use the same ombrian model as in the Uccle case (Digression 8.D) except that we use 
the FHK-CD type climacogram (equation (8.9)), which is more appropriate for the more complex 
shape of the empirical climacogram, seen in Figure 8.6. We also apply the same procedure as in 
Uccle, but we also plot comparisons with quantiles estimated by K-moments. 

The parameter values for the optimization cases examined, which are the same as in Uccle, are 
shown in Table 8.6. In comparison to Uccle, the most notable difference is the very high Hurst 
parameter (2 0.92; this is comparable to the value 0.90 that has been already estimated for the 
annual rainfall in Bologna in section 1.3). This has visible effects in terms of high bias in the 
climacogram for scales k > 1000 h (about 40 d) in Figure 8.6. 
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Figure 8.6 Fitting of the ombrian model (equation (8.9)) to the empirical estimates of the climacogram 
(upper) and climacospectrum (lower) for Bologna. The empirical estimates for time scales smaller than or 
greater than 1000 h (~42 d) are taken from the hourly and daily series, respectively. Note that the bias in 
the climacospectrum graph is negligible. 
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Table 8.6 Parameters of the ombrian model of Bologna. 
Case of optimization u(mm/h) A,(mm?2/h?) _Az(mm?2/h*)_ ah) H(-) 9(-) E(-) 








Climacogram = 0.000864 1.51 164 0.95 - 

Probability wet 0.0773 0.00775 0.836 14.15 0.95 0.795 0.121 
Quantiles 0.0788 0.00407 1.60 7.70 0.93 0.693 0.125 
Combined 0.0823 0.00110 1.43 8.74 0.92 0.787 0.121 





Note: The transition time scale k* was chosen 96 h (= 4 d). 


Probability wet, P, 











es 

















Empirical, |from daily series 
Model, optimization for prabability wet/dry 


Logarithm of probability dry, -In (1-P,) 





= —Model, combined optimization 
0.01 


1 10 100 1000 10000 
Time scale, k (h) 


Figure 8.7 Fitting of the ombrian model (equations (8.10) and (8.13)) to the empirical estimates of 
probability wet (P,) or dry (1 — P,) of Bologna. 


As seen in Figure 8.6, the empirical and theoretical climacograms compare very well or even 
perfectly for a fitting on the basis of the climacogram. This figure also includes the 
climacospectrum of the process, where the fitting is good enough. Likewise, as seen in Figure 8.7, 
the model can obtain a perfect fitting on the probability wet, if the optimization objective is this, 
but even in the combined optimization the fitting remains relatively good. 

The fitting on distribution quantiles for the combined optimization is shown in Figure 8.8, with 
return periods assigned by order statistics (upper graph) or by K-moments (lower graph). In both 
cases adaptations to take into account the bias due to the intense HK behaviour have been 
performed. There are no noteworthy differences between the two graphs, except the smother 
empirical curves in K-moments graph. The fitting is good for the entire range of time scales and 
return periods, each of which spans five orders of magnitude. Again, the good fitting supports the 
suitability of the ombrian model. 
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Figure 8.8 Ombrian curves as resulted from the ombrian model for Bologna for time scales spanning 5 
orders of magnitude (1 h to 16 years = 140256 h). The empirical points are estimated from order statistics 
(upper; using formula VI of Table 5.5) and K-moments (lower). In both cases the effect of persistence was 
taken into account; the ombrian model results were plotted for bias-adapted variance in order to be 
comparable with empirical plots (thus, for k > 1000 h or about 40 d, the true intensity resulting from the 
model is higher than what is shown in the graph). The abbreviation “y” stands for year. 
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8.6 Simplified ombrian relationship fitting 


The procedures of section 8.5 can also be applied in the simplified version of ombrian 
relationships of section 8.3, in the form of equation (8.28). In addition, the separability of 
functions a(k) and b(T) allows a two-step approach, with each step determining the 
parameters of each of the two functions separately. The procedure was introduced by 
Koutsoyiannis et al. (1998) and is based on expressing (8.28) in the form: 


a(k)x = AD(T) (8.47) 


We observe that the time scale k is not a stochastic variable (rather it takes on a set of 
values, which are chosen considering the data availability) and a(k) is a deterministic 
function thereof, while the right-hand side of the equation in essence is an expression of 
the (Pareto) distribution function, which does not depend on k. By substitution of 
equations (8.29) and (8.30), we can write (8.47) as: 


(1+4)'x=a((2) -1) (6.48) 


Hence, for the different time scales kj the stochastic variables y= a(k;)x = 
(1 +k/a)"x have a common distribution function. Thus, the y, for different k; can be 
regarded as samples from the same distribution. Let yj; = alk; )xji where xj; is the ith 


item of the sub-sample of size n; corresponding to the time scale k; and let 7; be its rank 
in the merged sample of all the y;; of size n = )); n;. Let the mean rank of each sub-sample 


ber = Yi 7% /nj. We replace all 7;; of each sub-sample with its mean 7; and we get a sample 
of size n with n, values equal to 7,, nz values equal to rp, etc. The estimators of its mean 
and variance’ will be: 


1 1 


= —\ 2 


J J 


where if there are no ties among different mean ranks, it is easy to see that r = (n + 1)/2 
(a constant value as 7;; vary from 1 to n). 

Now, it is easy to understand that if the samples are from the same distribution, that in 
the right-hand side of equation (8.47), then each 7; should be close to the mean and the 
variance y, should be minimal. Furthermore, given the observations x;;, the variance 


estimate y, depends on the parameters a and 7. Thus, we form a minimization problem, 
seeking to find the values a and n that minimize y,. With current computational tools 
(even common spreadsheet software) numerical minimization of a function of two 
variables is an easy task (it can even be solved without using a solver, by a trial-and-error 
method). 


* We notice that the variance resembles the Kruskal-Wallis statistic used to test whether several samples 
are from the same distribution. However, here we do not apply any test, nor would it be possible, as the test 
assumes independent samples, while clearly here they are dependent. 
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We further note that the method could even work with the variables y;; instead of their 


ranks Tji- Nonetheless, using ranks makes the method more robust, i.e., not affected by the 
presence of outliers in the samples. 

For the sake of improving the fitting of b(d) in the region of higher intensities (and also 
to simplify the calculations) it may be preferable to use in this first step of calculations a 
part of the data values of each group instead of the complete series. For example, we can 
use the highest 1/2 or 1/3 of intensity values for each time scale (Koutsoyiannis et al., 
1998). 

Once the values of a and n are determined, we proceed to the second step of 
calculations, which is pretty straight forward. Assuming that, with these values, all y,; are 


from the same distribution, we merge all k groups of values y,; thus forming a single 


sample. To finalize the task, it suffices to estimate the parameters of the Pareto 
distribution using e.g. the method of K-moments. This defines completely the form and 
the parameters of b(T). 

An advantage of the two-step method is that it allows giving different roles to different 
data sets in the fitting procedure. Thus, in the first step the parameters a (which is 
typically smaller than 1 h and needs sub-hourly data to be reliably estimated) and 7 (for 
which hourly or multi-hour time scales are most appropriate) should be based on 
subdaily and even sub-hourly data. In contrast, the parameters of b(T) are better deduced 
from daily raingauge data rather than from autographic rain recorder data, because the 
latter are more susceptible to measurement errors and also of shorter length. In particular 
the tail index € of b(T) should ideally be based on multi-station data of the area, or be 
assumed independently of data, according to experience in the area of study. 

We readily see that equation (8.48) is none other than the simplified equation we have 
examined in Digression 8.B, i.e., 


_ 6 T/BY = 1 


It is stressed that here the return period T corresponds to the complete rainfall process 
and the fitting is made on the entire data set, or on some values over threshold. However, 
as noted in section 8.4, in some cases the availability of data does not allow access to the 
full information but only to time-block (e.g. annual) extremes. In this case, the data do not 
correspond to the parent distribution, which is the natural basis for estimating design 
quantities, but to an extreme value distribution. In the latter case the Pareto parent 
distribution reflected in equation (8.32) should be converted to the EV2 distribution of 
extremes. Therefore, the model fitting should be made on the EV2, rather than the Pareto, 
distribution. Based on the analysis of section 2.19, the corresponding ombrian 
relationship in this case is: 


-_ 366/)ina=:4/7,))* = 1 
x = A—. oO 


ae (8.50) 


where T,, is the return period of the event that the value x appears as a maximum in a time 
block of length A, typically A = 1 year. While the model parameters should be fitted on 
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equation (8.50), the final model should be formulated for the Pareto distribution of the 
parent process with precisely the same parameter values. 


Digression 8.F: Tail index of rainfall intensity worldwide 


Both the full ombrian model and the simplified ombrian relationships share the same parameters 
€ and a. The tail index € determines the behaviour of the distribution tail and it is the most difficult 
to estimate. In the examples of Uccle (Digression 8.D) and Bologna (Digression 8.E) the available 
data sets were quite long and supported a reliable estimate of these parameter, which was found 
0.194 for Uccle and 0.121 for Bologna. However, for short data records this is not possible and 
thus it is useful to refer to the global behaviour as revealed from the analyses of global data sets. 

For years, the most prevailing model for rainfall extremes was the Gumbel distribution, which 
entails an exponential tail of the parent distribution, or € = 0. This is the smallest possible value 
for a distribution that is unbounded from above. Unjustified specification of € to its smallest 
possible value results in unsafe (too small) design rainfall values for large return periods. 
Recently, however, the appropriateness for rainfall of the exponential tail and the Gumbel 
distribution has been questioned. Koutsoyiannis (2004a, 2005a, 2007) discussed several 
theoretical reasons that favour the Pareto/EV2/Fréchet distribution over the 
exponential/EV1/Gumbel case. By now, several studies have provided empirical evidence 
supporting the Pareto case (€ > 0). Some of them, based on empirical evidence from daily rainfall 
records worldwide, are explicitly mentioned below: 


1. The data set compiled by Hershfield (1961) with 95 000 station-years, which he used to 
formulate his PMP method, in the study by Koutsoyiannis (1999), was found consistent with 
the EV2 distribution with shape parameter € = 0.13, or slightly varying with the average 


annual maximum rainfall h (in mm) as 


— = 0.183 — 0.00049 h (8.51) 


2. Koutsoyiannis (2004b, 2005a) compiled an ensemble of annual maximum daily rainfall series 
from 169 stations in the Northern Hemisphere (28 from Europe and 141 from the USA) 
roughly belonging to six major climatic zones, all having lengths from 100 to 154 years, and 
comprising a total of 18 065 station-years. The analysis provided sufficient support for the 
general applicability of a positive tail index. Furthermore, the ensemble of all samples 
supported the estimation of a unique shape parameter € for all stations. The estimated value 
of € varied for different methods of estimation and was found € = 0.09 for the maximum 
likelihood method, € = 0.10 for the L-moments method, € = 0.13 for the method of moments 
and € = 0.15 for a weighted least squares method. The latter method, by assuming weights 
equal to the empirical quantiles, gives higher importance to the high values and, as the 
resulting value leads to more conservative design, the value € = 0.15 was suggested as the 
preferred one. 

3. Papalexiou and Koutsoyiannis (2013) analysed the annual maximum daily rainfall of 15 137 
records from the GHCN daily database, with lengths varying from 40 to 163 years. Using the L- 
moments method, they fitted to all stations the GEV distribution, which comprises all three 
cases of extreme value distributions. The results clearly suggested that the EV3 distribution (a 
distribution bounded from above, with negative tail index) is completely inappropriate for 
rainfall, while the EV2/Fréchet law (€ > 0) prevails over the EV1/Gumbel law é = 0. The mean 
value of the shape parameter € for all stations was found to be 0.114. However, this value was 
not found to be representative for all parts of the world as there is variability. The statistical 
sampling effect explains a big part of the observed variability of the shape parameter around 
its mean value € = 0.114, but not the total variability. The authors concluded that the 
geographical location on the globe may affect the value of the shape parameter and constructed 
a map of the geographical distribution of the GEV shape parameter, which shows that large 
areas of the world share approximately the same GEV shape parameter. As a final remark, the 
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authors suggested not to follow blindly the statistical estimate of € based on whatever 
statistical method. In particular, they proposed that in the case where data suggest a negative 
tail index (distribution bounded from above), this should not be used. Instead, in this case it is 
more reasonable to use a Gumbel or, for additional safety, a GEV distribution with a tail index 
value equal to 0.114. 

4. Cavanaugh et al. (2015) analysed again a subset of the GHCN daily database, selecting over 
22 000 high quality stations across the globe, which pass certain quality control and temporal 
completeness criteria. They utilized an advanced test for differentiating between exponential- 
and heavy-tailed distributions of precipitation, and their results indicated that the majority of 
precipitation exceedance probabilities are of Pareto type and, therefore, most precipitation 
records have Pareto tails, not exponential. 


Additionally, Veneziano et al. (2009) used multifractal analysis to show that the annual rainfall 
maximum for time scale d can be approximated by a GEV distribution and that typical values of € 
lie in the range 0.09 to 0.15 with the larger values being associated with more arid climates. 
Similar results were provided by Chaouche (2001) and Chaouche et al. (2002). Chaouche (2001) 
exploited a data base of 200 rainfall series of various time steps (month, day, hour, minute) from 
the five continents, each including more than 100 years of data. Using multifractal analyses, it was 
found that (a) a Pareto/EV2 type law describes the rainfall amounts for large return periods; (b) 
the exponent of this law is scale invariant over scales greater than an hour (in fact, this is dictated 
by theoretical reasons; see Appendix 8-I); and (c) this exponent is almost space invariant. Other 
studies have also expressed scepticism for the appropriateness of the Gumbel distribution for the 
case of rainfall extremes. Coles et al. (2003) and Coles and Pericchi (2003) concluded that infer- 
ence based on a Gumbel distribution model fitted to the annual maxima may result in 
unrealistically high return periods for certain observed events and suggested a number of 
modifications to standard methods, among which is the replacement of the Gumbel model with 
the GEV model. Mora et al. (2005) and Bacro and Chaouche (2006) confirmed that rainfall in 
Marseille (a raingauge included in the study by Koutsoyiannis, 2004b) and other raingauges in 
southern France are not in the Gumbel law domain. Sisson et al. (2006) highlighted the fact that 
standard Gumbel analyses routinely assign near-zero probability to subsequently observed 
disasters, and that for San Juan, Puerto Rico, standard 100-year predicted rainfall estimates may 
be routinely underestimated by a factor of two. Schaefer et al. (2006) using the methodology by 
Hosking and Wallis (1997) for regional precipitation-frequency analysis and spatial mapping for 
24-hour and 2-hour time scales for the Washington State, USA, found that the distribution of 
rainfall maxima in this State generally follows the EV2 distribution type. 


Digression 8.G: Area-reduction of point ombrian curves 


The statistical analysis of rainfall extremes and the construction of ombrian relationships 
typically refer to a point (i.e., the raingauge station). On the other hand, in hydrology the 
transformation of rainfall to runoff occurs at the catchment scale and thus in engineering 
applications the rainfall intensity should refer to the catchment area. This should require 
additional statistical analyses for the areally averaged rainfall intensity. However, this is usually 
too difficult or impossible, because of the sparse network of raingauges as well as synchronization 
problems among the recordings of different devices. Therefore, a common method for a 
transformation of point estimates, to account for the spatiotemporal variability of rainfall across 
the river basin, suggests applying a reduction coefficient, called the area-reduction factor (or areal 
reduction factor, ARF). 

The ARF is defined to be the ratio of the areally averaged precipitation depth over a certain 
area A for a specified return period T and time scale k to the precipitation depth over any point of 
the area (assumed to be climatically homogeneous) for the same return period and time scale. 
Accordingly, to find the ARF we need to determine the distribution functions of both areal and 
point rainfall and divide the two for several return periods and time scales. A prerequisite for this 
is to form statistical samples of areal rainfall with sufficient length and for various time scales. 
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Another prerequisite for the definition to apply is the climatic homogeneity of the entire area, so 
that the same ombrian relationship applies to any point at the given area. 

Some studies miss the above definition and determine the ARF empirically, e.g. by averaging 
precipitation per event and considering the ratio of maximum point precipitation (also known as 
the centre point precipitation) to the areal precipitation; this does not make much sense. In fact, 
empirical procedures like the latter imply different empirical definitions of ARF. A comprehensive 
review of empirical procedures and alternative definitions can be found in Svensson and Jones 
(2010). Despite theoretical inconsistencies, results from empirical studies of ARF have certainly 
some usefulness. Recent studies which adopt the consistent definition have been made by 
Lombardo et al. (2006) and Overeem et al. (2010). Both of these studies use radar data to estimate 
ARF, which certainly provide a great potential for studying the spatial variability of extreme 
precipitation due to the improved spatial coverage, resulting in good indications of the spatial 
patterns of rainfall. Major improvements in ARF estimation are anticipated in the near future, as 
radar and satellite data of rainfall will become more reliable and will accumulate in time providing 
samples with lengths adequate enough to enable reliable investigation of the probability 
distribution of areal rainfall. It is noted though that the poorer quality of these data, compared to 
raingauge data, is also expected to affect ARF estimation. Indeed, Allen and DeGaetano (2005) 
found that radar-based ARF decays at a faster rate (with increasing area) than gauge-based ARF. 

Current literature typically gives ARF as a function of A and k, disregarding the effect of T, 
which is deemed small. Comprehensive investigations were carried out in the UK by NERC (1975) 
which provided tabulated values of ARF for a wide range of areas (1 to 30 000 km2) and time 
scales (1 min to 25 days). Koutsoyiannis and Xanthopoulos (1999, p. 154) fitted the following 
empirical expression to those tabulated values: 


(8.52) 


0.048 40:36-0.01 In A 
y = max (02s 1- a 


9.35 


where A is given in km? and k in h. The same relationship has been compared with nomographs 
by Hershfield and Wilson (1957) for the eastern USA and by the US Weather Bureau (1960) for 
the western USA; differences are visible but not very substantial and this supports applicability 
of equation (8.52) in other parts of the world. 


Appendix 8-I: -Proof that the tail index of a time-averaged process is 

constant at any time scale 

We assume that the stochastic variables x and y are nonnegative (if they are not, we truncate their 

distributions) and we let z = x + y. With the help of Figure 8.9 we can write: 
P{z>z}=P{x>z}, P{z>z}=Pf{y>z} (8.53) 

and 


P{z <z}>P {x <2/2,y < 2/2} =P {x <z/2ly <z/2}P fy < 2/2} (8.54) 


For independent and positively dependent x and y we have P {x <z/2ly< 2/2} 2 Pix < 2/2}. 
Thus, P{z < z} > P {x <2/2,y = 2/2} > P{x <z/2}P {y < 2/2} and consequently: 


P{z > z} < P{x > z/2} +P fy > 2/2} — P{x > 2/2}P {y > 2/2} 


8.55 
< P{x > 2/2} + P{y > 2/2} 
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Figure 8.9 Auxiliary sketch of the proof of the constancy of tail index. 
As a result: 
max (F(z), Fy(z)) < F,(2) < Fy(2/2) + Fy (2/2) (8.56) 
Multiplying by z1/¢ for € > 0 and taking limits we obtain 
lim z/E max (F,@,F,@)) < lim ZUSF(z) < lim zVSF,(z/2) + lim ZeFy(z/2) (8.57) 


Let g, and ¢, be the tail indices of x and y, respectively, and assume ¢, = ¢y (ifnot, we interchange 


x and y and have the same results). According to the definition of tail index (equation (2.63)), this 
means that: 


lim x¥S*E@)= te, Jim y/7B,G) = b (8.58) 


where /, and L, are finite. At the same time it means that lim x1/SF,(x) = 0 for any & > &, and 
x—0O 
lim x1/5F,.(x) = © for any ¢ < é,, and likewise for y. If we assume ¢, > é and take = ¢,, the 
xXx—-0O 
rightmost part of (8.57) becomes: 
lim 21/$*F,(z/2) + lim 2/5*F,(z/2) = lim (22')1/5*F,(z’) + lim (22')1/8«Fy(z’) 
Z—>00 Z>00 Z->00 Z—>0O 
(8.59) 
= 2Msx1. +0 = 2/81, 
and the leftmost part is lim max (2/5F,(2), Ze Fy, (z)) = max(L,,0) = L,. Thus, (8.57) 
Z— 00 
becomes: 
L. < lim z/5*F,(z) < 22/5 L, (8.60) 
Z—>0O 
Furthermore, for any é < &, the leftmost quantity in (8.57) is co and thus lim z1/*F,,(z) = ©. Also 
Z—>0O 


for any € > é, the rightmost quantity in (8.57) is 0 and thus lim z!/SF,(z) = 0. If & = éy the 
Z—-00 


above results are valid except that the rightmost part of (8.60) becomes 2!/§x*+11,. Summarizing 
these results, we have: 
7 i Se 
lim z¥/4F,(z) ={l, €=&, (8.61) 
Z—>0O 


mo § <x 
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where l, is a finite number satisfying l,, < 1, < 21/5«*11,. This proves that the tail index of the 
distribution of the sum of two variables equals the maximum of the tail indices of the two 
variables. This result is readily expanded for many variables. Consequently, for a stationary 
stochastic process, the tail index is preserved in the cumulative process, which is a sum of many 
variables with same tail index, and hence of the averaged process. 


Appendix 8-II: Relationships of climacogram and parameters of the 
ombrian model 


By standard algebra on equation (8.6) we find that the pth moment of x“) is: 


(k) Pp 
w)?] = yt - a AW) Pa( Pot Pe : 
E[(e®)"] =o = aygoren 8 (ze - za) ee 


Hence, the mean is 
(k) 
Pr’ ak bd 1 
E[x®] =p = a (8.63) 
C(K)EMSHO) ACK) E SC) 
and the squared coefficient of variation is: 
2: vk 2 
yy _ OB (eeE- 7H 
[le (k) (a5 ie ee ak ) 
| imag 3) shy 
C(k)’e (Kk) 
For the special case that ¢(k) = 1 (equation (8.5); Pareto), the mean and squared coefficient of 
variation simplify to: 


C3[x®] = = (8.64) 





(k) 
Peak k 2(1 - 
E(x] =yu= fy ely ()] = y(k) ) Ss - (8.65) 
Lees Wo a —2e)p 
Combining these with equations (8.9) and (8.15) we find: 
si H-1 _M_ OM 
: va = 1-H 
Yo(1 + Uimax/ a) i ee oe (= ye) a4 (8.66) 
LU 1 26 Lu 


However, for the general case, equations (8.63) and (8.64) are implicit for € and ¢(k), and too 
complicated for our purposes. Therefore, we look for simplifying approximations. For an 
approximation of (8.63) it can be seen that the part of the right-hand side that contains € and ¢(k) 
is equal to 1/(1 — €) for ((k) = 1 and tends to 1 as @(k) > o. A numerical investigation showed 
that a very good approximation with these properties is the following: 


ue Pack) | 1 +———. - ———— (8.67) 


GQ-A@) (cao)? 


For an approximation of (8.64), we initially examine the case p® = 1, for which: 


2 
C2[x6] = 25(4) B (eae F ~ 0m) 


» (capt Z0) 


~1=:C(¢(k)) (8.68) 
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It is easily shown that for ((k) = 1, C(¢(k)) equals 1/(1 — 2€). Furthermore, it can be shown 
analytically (the proof is omitted) that, as ¢(k) — o (which happens when k = oo), the LLD of C 
as a function of ¢ is C*(¢(k)) = —2. Also, numerical investigation shows that the slope of —2 is 
virtually constant for the entire domain of ¢ = 1. This observation, combined with the value of 
Ge [2c(Kmax)], enables the very simple approximation: 


1 
C(¢(k)) = G28) @(%)2 (8.69) 
Hence, in the general case: 
y(k) _ C(G(k)) +1 
ce [x] = aa = ~ pe = (8.70) 
which yields: 
aoe : allie (8.71) 


P PMa-2|Gmy? A” 


Once vp” v(k) and pw are known, the unknown ¢(k) and A(k) are easily found to be given by 
equations (8.11) and (8.12). 


Chapter 9. Streamflow maxima and minima 


9.1 Streamflow extremes compared to rainfall extremes 


While rainfall databases have been publicly available for a few decades, and this enabled 
the study of rainfall extremes and extraction of generalized results over the globe, 
streamflow databases with publicly available data are a more recent—and partial— 
development. Therefore, general results have not yet been obtained. The study of 
streamflow involves many more difficulties than that of rainfall. The measurement of 
streamflow is a demanding task and needs sophisticated equipment and analyses, and 
observational experience. In addition, what we measure does not necessarily represent 
the natural streamflow process. Several large-scale control structures on rivers, such as 
dams, levees, intakes and diversions, have seriously modified the natural process. 
Reconstructing the natural regime from the measurements needs appropriate knowledge 
of the modified hydrosystem and its control, and simultaneous processing of several data 
sets. The so derived reconstructed time series, usually called “naturalized”, are rare and 
usually not available online. Therefore, to understand the natural regime, it is much easier 
to analyse data from pristine catchments. 

In this chapter we provide several representative examples using long time series from 
the database of the US Geological Survey’s (USGS) National Water Information System. 
From this database, Hirsch and Ryberg (2012) selected 200 stream gauges in the 
coterminous USA, of at least 85 years length through water year 2008, from basins with 
little or no reservoir storage or urban development (less than 150 persons per km? in 
2000). These stations are an ideal source of the examples given here. The data retrieved" 
extend up to 2020 and are free of missing values (some “provisional” values for the most 
recent months were not used in the analyses). In Europe, a set of 224 stream gauges was 
studied in Iliopoulou et al. (2019) but in this case the time series are shorter; thus, only 
the longest of them (Po river with 90 years of observations, Montanari, 2012") is analysed. 

As we will see in the examples and contrary to rainfall, where at the lowest scales the 
Pareto distribution, with lower-tail index ¢ = 1, is appropriate, the streamflow often 
exhibits a higher value of ¢ even at small scales; this implies a bell-shaped density function. 
Therefore, a candidate distribution for streamflow is the PBF. While the lower-tail index 
¢ is most important for the low extremes, the higher-tail index € is most important for the 
high extremes. The relationship of € in streamflow with that of rainfall is discussed in 
Digression 9.A. 

Intermittence is a prominent behaviour in rainfall and has been modelled in Chapter 8 
through the probability wet (P,) or dry (P)). This can also be the case in streamflow but 





* Data retrieved on 2020-08-22 from https://nwis.waterdata.usgs.gov/nwis/inventory; the discharge 
values were converted from ft/s to m3/s. 

+t Data made available by Alberto Montanari, https://www.albertomontanari.it/sites/default/files/ 
uploadedfiles/po-pontelagoscuro.txt, retrieved on 2020-08-22. The data set is affected by the Italian tactic 
to remove the values of 29 February at leap years. 
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only in small streams which often, during the summer months, become dry. Large rivers 
are very rarely dry, yet intermittence appears in a different mode. Specifically, there are 
two states, one in which the river is fed merely by groundwater (baseflow) and one 
dominated by flood. As we will see, a simple technique to model this type of intermittence 
is to set a positive lower limit to the PBF distribution. 

As discussed in Chapter 8, in rainfall it is most often necessary to model extremes at 
multiple scales—whence the need to construct ombrian relationships. In streamflow this 
is not usually the case. Instead, often there is a need to construct an operational stochastic 
simulation model. Several tasks can only be studied by means of stochastic simulation. 
For example, in studying the design floods of dams, it does not suffice to determine the 
value of river discharge for the design return period. Rather, we should determine the 
value of the outflow discharge of the dam spillway. In terms of low events, again it does 
not suffice to determine the river discharge for a design return period or reliability. 
Rather, we need to establish a relationship between reservoir policy (e.g. reliable yield of 
the reservoir) and probability of emptying of reservoir. 

All these tasks are easily dealt with by stochastic simulation. A theoretical description 
of the related concepts can be found in Koutsoyiannis and Economou (2003) while a 
simplified application of a simulation methodology for the design of a reservoir spillway 
can be found in Koutsoyiannis (1994). Here we will not give detailed applications of 
simulation, as this is not the focus of this text. However, all concepts necessary for 
simulation are contained in Chapter 7. We note that in a simulation focusing on minima 
(droughts), e.g. in determining the reliable reservoir yield, a monthly time step usually 
suffices and the preservation of seasonality by means of a cyclostationary model becomes 
important (e.g. Koutsoyiannis, 2000, 2001). Instead, in a simulation focusing on maxima 
(floods), e.g. in determining a spillway’s design discharge, a subdaily simulation step is 
necessary and, in this case, it becomes important to preserve the time irreversibility of 
the streamflow (see section 7.5 and Koutsoyiannis 2019b, 2020a), which however 
becomes negligible on the monthly time scale. 

While a stochastic model is constructed for the specific time scale of interest (subdaily, 
daily, monthly or even annual, depending on the particulars of the application) a study of 
some multi-scale characteristics of the process is necessary (e.g., to characterize the time 
dependence and possibly the persistence). This is most effectively done through the 
process climacogram. In some applications (particularly in studies of large hydrosystems 
with many reservoirs) stochastic models at two or more time scales are often involved. It 
becomes thus imperative to make the models consistent to each other. This is done by 
techniques called model coupling or disaggregation (Koutsoyiannis and Manetas, 1996; 
Koutsoyiannis, 2001). 


Digression 9.A: Does the tail index of streamflow differ from that of rainfall? 


The tail index is particularly important for characterizing the extraordinary extreme values (e.g., 
of return periods of the order of 1000 years). Underestimation of the tail index may markedly 
underestimate the value of such events in the design phase or overestimate the return period of 
extraordinary events that have occurred. As streamflow is caused by rainfall, it is interesting to 
examine whether or not the transformation of rainfall to runoff preserves the tail index. Here we 
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provide some hints about what we may expect, rather than a consistent analysis of the problem, 
which would not be easy. 

In hydrology, we usually model this transformation within a catchment using the conceptual 
analogy of one or more connected reservoirs. Here we consider just one reservoir, which could 
be one segment in the catchment, with input /(t) the upstream flow discharge and output Q(t) 
the downstream flow discharge; alternatively, it could represent the entire catchment with the 
inflow being the precipitation. The outflow is assumed to be the reservoir spill. We wish to see 
whether or not the transformation of input to output preserves the tail index. 

Inflow and outflow are connected by the continuity equation (conservation of mass), i.e.: 


dS 

ae (9.1) 
where S is the reservoir storage. We need one more equation to fully describe the transformation 
of I to Q which we construct by means of a stage-discharge and a stage-storage relationship. We 
assume these to be of power type: 


C=kz, 9S =S, +m 2” (9.2) 


where z is the water elevation and k, l,m,n, Sp are constants. From these we get z = (Q/k)/and 
S=Sy+m(Q/k)/' = 5) +aQ’, where B:=n/l and a:=m/k"'. Eliminating S from the 
differential equation we obtain: 


dQ 
[=i = 
apQr-* aos! (9.3) 


The value of the exponent f — 1 determines the behaviour of Q. To get an idea of what the value 
of the exponent could be, we recall from hydraulics that a typical value of l in spillways is 3/2. The 
exponent n is determined by the topography of the reservoir. For a prismatic reservoir, n = 1, 
while for a pyramidal or conic one n = 3 and hence f becomes 2/3 and 2 for these two cases, 
respectively. Therefore, the exponent f — 1 would be -1/3 and 1, respectively. The parameter a 
(whose dimension is such that a@Q*~' has dimension of time) is also useful to interpret. A large 
inundation plain would imply a large a to express the fact that in a large plain Q would be less 
sensitive to S (as for, say, 8 = 1,AQ = AS/a). 

The differential equation (9.3) admits a closed solution only if 6 — 1 = 0. This case is indeed 
reasonable according to the above discussion and results in a first-order linear differential 
equation with general solution (see section 3.11): 


1 t 
Q(t) = Qe */* += i en) NG iids (9.4) 
a Jo 


where Qy = Q(0) (the initial condition). This has been very popular in hydrology and is known as 
the linear catchment model. Evidently, if we multiply the input by a constant A, the output will 
also be multiplied by the same constant. Furthermore, treating the input as a stationary stochastic 
process I(t) and taking a large t, so that e~‘/* ~ 0, we see that the output is also a stationary 
stochastic process. Now, temporarily assuming that the inflow is independent in time, we see from 
equation (9.4) that if for some q the moment E[/(t)?] diverges to infinity, then E[Q(t)%] will 
diverge too; also, if E[I(t)“] is finite, then E[Q(t)?] will be finite too. Therefore, as the inverse of 


the tail index, 1/€, is the threshold value determining whether or not the moments diverge (q => 
1/§) orare finite (q < 1/¢), we infer that if the tail index of /(€) is ¢, then that of Q(¢) will also be 
€. This result can be extended to [(t) dependent in time, taking into account that again any I(t) is 
linearly equivalent to white noise (cf. Wold’s decomposition; Digression 3.E). 
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However, the behaviour changes if 6 # 1 as the differential equation (9.3) turns to nonlinear. 
To illustrate the behaviour in this case, as there is no closed solution, we consider as an example 
an input segment (surge) with mathematical form /(t) = Ate~‘, whose total volume is A. We solve 
the equation numerically to find Q(t) for several values of A, and investigate the ratio of the peaks 
of outflow to inflow, as the peak flows are the most representative for the behaviour in the 
distribution tails. 

Some results have been plotted in Figure 9.1 for several values of the exponent f and the scale 
parameter a of the storage-discharge relationship 5 = Sy + aQ*. It can be seen that, while in the 
linear case the ratio Q/I of peak values is constant (as expected based on the discussion of the 
linear case), for 8 # 1 the results roughly support a relationship of power type, Q/I « I”, where 
y # Ois the slope in the doubly logarithmic plot of Figure 9.1. Thus, Q1/@*”) « J, If € is the tail 


index of inflow, which means that E[/(t)1/*] = 0, then E le (Qvams] = oo. Hence, we conclude 
that the tail index of the outflow will be ¢g = (1+ y)é. As y (ie. the slope of Figure 9.1) can be 
either positive, zero or negative, €g will be either greater than, equal to, or smaller than ¢. In 
particular, the presence of large plains within the catchment will signify a decrease of €. 
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Figure 9.1 Ratio of peak outflow to peak inflow in a reservoir with inflow I(t) = Ate~‘, as a function of the 
inflow total volume of A, for several values of the exponent f and the scale parameter a of the storage- 
discharge relationship S = (Q/a)*. Values of 8 = 1,< 1and> 1 correspond to a linear reservoir, a reservoir 
with roughly prismatic shape, and one with roughly pyramidal shape, respectively. A large a corresponds 
to a large reservoir area (e.g. the inundation ofa plain). 


Of course, the whole runoff process on the entire catchment includes other types of routing in 
addition to that across the flow in a river. Some of them, like snowmelt, tend to increase the tail 
index, while other, such as retention and infiltration, tend to decrease it. 

In conclusion, only in catchments in which an assumption linearity (6 = 1) is justified, can we 
expect a tail index of streamflow equal to that of rainfall. In large catchments, which include large 
flood plains, it would not be a surprise if we estimated a tail index equal to zero (a light-tailed 
distribution) even if the tail index of rainfall is positive. 


9.2 PBF distribution fitting on streamflow 


The extreme-oriented fitting of probability distributions has been already discussed in 
section 6.15, and implemented and further investigated in Digression 6.E. Here we will 
give a more general algorithm which can be used for fitting with emphasis either on high 
extremes or low extremes, or even on the body of the distribution. The algorithm uses 
noncentral K-moments of orders (p,1) (and for low extremes tail K-moments), with p 
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being within the feasible range (1,n), where n is the sample size. The steps of the 
algorithm are the following. 


1. 


We choose a number ™m + 1 of moment orders, i.e., pj = ni/m j = 0,...,m, with Po= 
1,)m =n. While, when dealing, for instance, with daily flows, the sample size n is 
usually of the order of several thousands, the number m could be chosen much 
smaller, e.g. of the order of 100, to speed up calculations without compromising 
accuracy. The orders p; need not be natural numbers. 

We estimate the noncentral K-moments Ky, of orders (p;,1),i = 0,...,m (and for 


low extremes the tail K-moments Ky) using equations (6.66) (or (6.72) for tail 
moments) and (6.68). 

We construct the climacogram of the time series and estimate the Hurst parameter. 
If the Hurst parameter is large, say H = 0.8, we replace in the following steps the 
orders p; with p;, where the latter are given by equations (6.94) and (6.96). 

Using default values of the parameters ¢ and € we estimate the A-coefficients A, 
and A,, (or A; and A,, for tail moments); furthermore, for all p; (or p} if they are 
different) we estimate the empirical return period Tk) from equation (6.113) 


—/ 


(or T (K,,) from equation (6.124)). 
Assuming default values of the scale parameter A and of the lower bound x,, we 
estimate the theoretical return period of each noncentral (or tail) K-moment as: 


—! 


aD = (: +E Sy) T (K,,) = 1 se 


1 
x—xy\S\ 
1-(1+¢¢ (75) ) 
We form an expression for the total fitting error as the sum of the logarithmic 
deviations of empirical and theoretical return periods, i.e.: 


E(G,,A,x,) = > wr (In (7(Kp,)) — In (r(%5,))) 


E(,€,4,x1) = a Wr (In (7 (®,)) =a (7 (K.))) 


where wr and wr denote weighting coefficients. These errors are functions of the 
chosen parameters (¢, €, A, x;,) and we evaluate them for the chosen parameter set. 
We repeat the calculations of steps 4-6 for different sets of parameter (¢,€,A, x1) 
until the fitting error becomes minimal. 





(9.6) 


2 


The repetitions of steps 4-6 are executed by a solver (such as those in spreadsheet 
software) using as objective function (to be minimized) the error E(¢,é,A,x,) (or 


E(¢,é,A,x,) in the case of tail moments). The procedure is typically very fast, almost 
instant. 
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A default value of the weights is wp = wy = 1. However, if we wish to focus the fitting 
on a particular part of the distribution, we can use different weights; for example, if we 
wish to neglect all values with return periods smaller than time d (e.g. d = 1 year) we may 
use: 


Hi x<d 
Wr = 


1, x>d (9.7) 


If we wish to give more emphasis on fitting to high return periods, we may use 


Z b 
Wr = (T(Ky,)/d) (9.8) 
where b is a positive number, e.g. b = 0.5. Likewise for the weights w. 
Obviously, instead of the approximations of the A-coefficients by equations (6.94) and 
(6.96), the more accurate nonlinear approximations of section 6.14 could be used. And as 
we will see in the applications that follow, the difference in the fitting is negligible. 


A first application of the methodology is given in Digression 9.B while additional 
applications will be seen in subsequent sections. 


Digression 9.B: Fitting of a single PBF distribution on the entire domain 


We have seen in Digression 6.E that the PBF distribution with a single parameter set provided a 
good fit for a 206-year long record of daily rainfall in Bologna. Now we investigate whether this 
is feasible for streamflow data. We use as an example the streamflow data of the French Broad 
River at Asheville, NC, USA (USGS station 03451500, 35.609°N, 82.578°W, drainage area 2 447.5 
km_2). The data cover the period from October 1895 to March 2020 (more than 124 calendar years, 
uninterrupted; daily values n = 45 468). 
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Figure 9.2 Empirical o-climacogram and K,-climacogram of daily data of the French Broad River at 
Asheville, NC, USA. The power-laws, fitted by regression and plotted as dashed lines, have slopes -0.49 and 
-0.48, respectively. The Hurst parameter, estimated from the annual series, is H = 0.58. 


The climacogram of the time series is shown in Figure 9.2. Namely we give the o-climacogram, 
which is the (doubly logarithmic) plot of the standard deviation, o, vs. the time scale, k. In addition, 
we show in the same graph the K,-climacogram, which is the plot of the central K-moment of 
orders (2,1), Ky, vs. the time scale, k. Both quantities have units of m?/s and their plots become 
virtually parallel straight lines for scales k > 1 year; their slope is H — 1 where H is the Hurst 
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parameter. Here the Hurst parameter is estimated by the algorithm by Koutsoyiannis (2003) at 
H = 0.58. The relatively small value of H suggests that the time dependence effect can be 
neglected in estimating K-moments, which do not need any adaptation. 

The K-moments are estimated for m + 1 moment orders with m = 80, so that p; := n‘/™ (pp = 
1,Pgo = n = 45 468) using the unbiased estimators. Both noncentral and tail moments are 


estimated. In a following step, for each Kj, or K,,the return period of each Te) or T (K,,) is 
estimated through the A-coefficients using both the linear and nonlinear approximations. 
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Figure 9.3 Comparison of empirical and theoretical PBF distribution fitted on daily data of the French 
Broad River at Asheville, NC, USA. The fitting was done based on noncentral and tail K-moment using the 
linear approximation (denoted in the figure “K-moments 1”). For comparison the nonlinear approximation 
(“K-moments 2”) is also shown but it is indistinguishable from the former case. Furthermore, empirical 
return periods assigned from order statistics are also shown. The distribution function is depicted in terms 
of return periods T or T (upper and lower row, respectively), either on their entire range (left column) or 
focusing on those greater than 1 year (right column). The parameters are: € = 0.307,¢ = 2.40,A = 
51.4 m?/s, x, = 4.38 m?/s. 


Figure 9.3 shows the empirical estimates of these K-moments against their return periods. 
—! 
Both K,, and Ky, have been used for each of the plots; note that given the return period with 


—— 
reference to minima T (K vi) the return period with reference to maxima is: 
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z (F) : (9.9) 


e (1 by (X,.)) 
In this manner, we use 2m + 1 = 161 points in each of the plots. Plots are given in terms of both 
T and T in Figure 9.3. 

In all plots, the series of points, calculated by the linear and the nonlinear approximations of 
the return periods of K-moment, coincide almost completely and are virtually indistinguishable. 
In addition, the empirical return periods based on order statistics are also plotted and agree very 
well with the K-moments plots, except a scatter for the largest return periods. 

The fitting of the PBF distribution was made with the algorithm described in section 9.2 
minimizing the sum E + E with default weights (= 1). The parameters of the distribution are 
shown in the caption of Figure 9.3. The theoretical PBF distribution fitted is also indistinguishable 
from the curves of the K-moments. Overall, the plot shows a satisfactory global fit of a single PBF 
distribution on return periods spanning 5 orders of magnitude. 

However, if we focus on the high return periods, as seen on the right panels of Figure 9.3, we 
will conclude that this global fit is not perfect. We may thus decide to perform two different 
fittings, with different parameter sets, separately for the low and the high flows. To do this it 
suffices to use the weights of equation (9.7). The resulting plots are shown in Figure 9.4, with the 
parameter values given in the figure caption. Now the two fittings are perfect. 
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Figure 9.4 Comparison of empirical and theoretical PBF distribution fitted on daily data of the French 


Broad River at Asheville, NC, USA, as in Figure 9.3 but with a fitting focusing on T > 1 year (left) or T > 1 
year (right). The parameters are, for the left panel: = 0.277,¢ = 5.06,4 = 81.7 m3/s, x, = 4.87 m3/s, and 
for the right panel: € = 0,¢ = 2.16,d = 77.3 m3/s,x, = 4.40 m?/s. 


9.3. Fitting a distribution on the distribution body 


While the fitting methodology described in section 9.2 is extreme-oriented and thus good 
for the distribution tails, it is also quite general and can easily provide a fit to the body of 
the distribution. The easiest way to do that is by using only low-order moments. As the 
PBF distribution, in the form used in this chapter, contains four parameters, a simple 
technique is to fit so that the first four theoretical K-moments match the corresponding 
empirical ones. This can be done applying again the framework of section 9.2 but with 
weights: 
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arp \\ flr Pix 12,34 
m (7(K;,)) 7 to otherwise 0) 


The advantage of using this framework, in addition to its generality, is that we bypass the 
theoretical calculation of the K-moments per se, in essence replacing it with the 
calculation of A-coefficients which are easily and accurately approximated. An illustration 
is given in Digression 9.C. 


Digression 9.C: An example of fitting a PBF distribution on its body 


Here we use another example, the streamflow data of the Susquehanna River at Danville, PA, USA 
(USGS station 01540500, 40.958°N, 76.619°W, drainage area 29 059.7 km2). The data cover the 
period October 1905 to June 2020 (more than 114 calendar years, uninterrupted; daily values 
n = 42 081). The o-climacogram and the K3-climacogram of daily data are shown in Figure 9.5. 
The Hurst parameter estimate is H = 0.61 and its effect on return periods can be neglected. 
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Figure 9.5 Empirical o-climacogram and K,-climacogram of daily data of the Susquehanna River at 
Danville, PA, USA. The power-laws, fitted by regression and plotted as dashed lines, have slopes -0.41 and 
-0.39, respectively. The Hurst parameter estimate is H = 0.61. 


Initially, we try a global fitting following the procedure of Digression 9.B. This is shown in 
Figure 9.6, where it can be seen that this fitting is unsatisfactory for both distribution tails, as well 
as for the body of the distribution. Good fittings on the tails are shown in Figure 9.7, performed in 
the same manner as in Digression 9.B. 

For a fitting on the body of the distribution we can follow the procedure of section 9.3. This is 
illustrated in Figure 9.8. The resulting theoretical distribution is good for the body but totally 
inappropriate for either of the tails. 

Thus, in this case we have three different fittings, with different parameters sets, each of which 
are good for a part of the distribution, the upper tail, the lower tail and the body. Assuming that 
we perform simulation for a particular technical problem, which of the three should we use? For 
problems related to floods, it is reasonable to use the fitting on the upper tail. But for problems 
related to low flows the answer is not that direct. At first glance it appears that the fitting on the 
lower tail is pertinent. This is actually the case when estimating environmental flows in pristine 
basins. However, when studying regulation structures such as reservoirs, simulation of low flows 
should be performed as, in this case, it is not the quantity of natural low flow that matters, but the 
succession of flows. Therefore, it may be more appropriate to make the fitting on the body of the 
distribution. 
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Figure 9.6 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Susquehanna 
River at Danville, PA, USA. The distribution function is depicted in terms of return periods T or T (upper 
and lower row, respectively), either on their entire range (left column) or focusing on those greater than 
1 year (right column). The parameters are: & = 0.202,¢ = 1.43, = 495.6 m?/s,x, = 15.56 m?/s. For 


further explanations see caption of Figure 9.3. 
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Figure 9.7 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Susquehanna 
River at Danville, PA, USA, as in Figure 9.6 but with a fitting focusing on T > 1 year (left) or T > 1 year 
(right). The parameters are, for the left panel: € = 0.0, = 0.721, 4 = 339.1 m?/s, x, = 15.64 m?/s, and for 
the right panel: € = 0.5,7 = 2.06,4 = 199.4 m?/s, x, = 14.60 m3/s. 


DISTRIBUTION FITTING IN THE PRESENCE OF PERSISTENCE 285 
































10000 
= __ 10000 
wn wn 
oo oo 
= 1000 = 
og og 
o o 
po 2 
oO oO 
<= x= 
2B 2B 
Q 100 a 
=—— = Empirical, from K-moments 1 
==== Empirical, from K-moments 2 
e Empirical, from order statistics 
Theoretical 
10 1000 
0.001 0.01 0.1 1 10 100 1000 1 10 100 1000 
Return period, T (years) Return period, T (years) 
100 
10000 
<a ca 
~ fog) 
€ € 
— 1000 = 
og og 
o o 
po a 
= = 
2 2 RRLee06 
= : pS | 
10 10 
0.001 0.01 0.1 1 10 100 1000 1 10 100 1000 
Return period,T (years) Return period, T (years) 


Figure 9.8 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Susquehanna 
River at Danville, PA, USA, as in Figure 9.6 but with a fitting focusing on the body of the distribution 
performed by fitting the first four noncentral K-moments. The distribution function is depicted in terms of 


return periods T or T (upper and lower row, respectively), either on their entire range (left column) or 
focusing on those greater than 1 year (right column). The parameters are: € = 0.247,¢ = 1.04,A = 
329.8 m?/s, x, = 15.64 m?/s. For further explanations see caption of Figure 9.3. 


Obviously, one may think of using a more complex model, such as the sum of two PBF 
distributions. This is feasible and only requires a slight modification of the methodology, but it is 
out of our scope as here model parsimony is a strong desideratum. 


9.4 Distribution fitting in the presence of persistence 


Persistence is quite frequent in streamflow. However, in the above illustrations, its 
intensity was moderate as the Hurst parameter was around 0.60. If it becomes large, 
around 0.80 or greater, then it affects the estimates of K-moments and should be taken 
into account. The fitting methodology described in section 9.2 can deal with such cases 
readily. An illustration is given in Digression 9.D. 
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Digression 9.D: An example of distribution fitting in the presence of 
persistence 


We use as an example of a streamflow record with persistence the Red River of the North at Grand 
Forks, ND, USA (USGS station 05082500, 47.927°N, 97.029°W, drainage area 77 958.6 km2). The 
data cover the period April 1882 to November 2019 (more than 136 full calendar years, 
uninterrupted; daily values n = 50 269). The o-climacogram and the K,-climacogram of daily 
data are shown in Figure 9.9. The Hurst parameter estimate is H = 0.91 and should have a marked 
effect, i.e., bias, on return periods. We note that, with this value, the slope of the theoretical o- 
climacogram is -0.09. However, the empirical o-climacogram in Figure 9.9 has a slope -0.18. This 
is not an inconsistency or an error; it is a result of the fact that the empirical climacogram is 
affected by bias, which was taken into account in the estimation of H = 0.91. 
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Figure 9.9 Empirical o-climacogram and K,-climacogram of daily data of the Red River of the North at 
Grand Forks, ND, USA. The power-laws, fitted by regression and plotted as dotted lines, have slopes -0.18 
and -0.17, respectively. The Hurst parameter, estimated from the annual time series, is H = 0.91. 


The bias correction factor is 9 ~ 2H(1—H)/(n—1)—1/2(n—1)?"74 = —0.071. Using 
p=260+0- 20)p(+9)") we transform each p to p’. For p=1,p’ =1 and for p=n= 
50 269, p’ = 12 977 (a big reduction, almost to 1/4). The return periods are then calculated as 
T(Kp,) = Aap’ + Ay — Aw. The remaining steps are the same as in Digression 9.B. The global 
fitting is shown in Figure 9.10. The empirical return periods based on K-moments, as plotted in 
the figure, are adapted for the bias due to persistence while those based on order statistics are 
not. While adaptation of the latter is possible, as described in section 8.5, we deliberately avoided 
it to illustrate the difference. Fittings on the tails are shown in Figure 9.11 performed in the same 
manner as in Digression 9.B. 
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Figure 9.10 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Red River 
of the North at Grand Forks, ND, USA. The distribution function is depicted in terms of return periods T or 
T (upper and lower row, respectively), either on their entire range (left column) or focusing on those 
greater than 1 year (right column). The empirical return periods based on K-moments are adapted for the 
bias due to persistence while those based on order statistics are not. The parameters are: € = 0.194,¢ = 
0.906, A = 106.8 m?/s, x, = 0.049 m?/s. For further explanations see caption of Figure 9.3. 
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Figure 9.11 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Red River 
of the North at Grand Forks, ND, USA, as in Figure 9.3 but with a fitting focusing on T > 1 year (left) or T > 
1 year (right). The parameters are, for the left panel: € = 0, = 0.720,4 = 148.5 m3/s, x, = 0.033m3/s, 
and for the right panel: € = 0.5,¢ = 1.008,A = 59.7 m3/s, x, = 0.047 m3/s. 
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9.5 Distribution fitting in the presence of extraordinary extremes 


Sometimes the streamflow time series contain an extraordinary extreme, which may 
differ significantly from other extremes. This is usually called an outlier and needs 
particular attention (Grubbs, 1969). An outlier may indicate measurement error but 
usually it just reflects the high variability of the streamflow—the so-named Noah effect 
(Mandelbrot and Wallis, 1968). Many view an outlier as a cause of serious problems in 
statistical analyses and exclude it from the data set to make the analysis easier and the 
model fitting more elegant. 

However, this is not a proper tactic. When studying extremes, excluding the most 
extreme observation is rather irrational. The suggestion here is to check the measurement 
conditions to see if it reflects a measurement error. Once this possibility is excluded, the 
outlier should be kept in the data set and the fitting methodology described in section 9.2 
should be kept unchanged. The K-moments framework and in particular the use of K- 
moments for order q = 1 is the most robust in the presence of outliers. This does not 
mean that the model would not be affected by outliers—this would be unreasonable. An 
illustration is given in Digression 9.E, where we also discuss the difference when 
accounting for or excluding the highest observation. 


Digression 9.E: An example of distribution fitting in the presence of 
extraordinary extremes 


In this example we use the streamflow record of the Tenmile Creek near Rimini, MT, USA (USGS 
station 06062500, 46.524°N, 112.257°W). The drainage area is small, 80 km2). The data cover the 
period October 1914 to February 2004 with a gap from October 1994 to April 1997 (85 full 
calendar years, daily values n = 31 706). The daily flow time series is plotted in Figure 9.12, from 
where it can be seen that on 1981-05-22 an extraordinary extreme flood occurred with a 
discharge of 53.24 m3/s. This is by an order of magnitude higher than the usual flood discharges. 
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Figure 9.12 Plot of the discharge time series at the Tenmile Creek near Rimini, MT, USA. 


The o-climacogram and the kK ,-climacogram of data are shown in Figure 9.13. The 
climacograms show an erratic behaviour at scales between 1 and 2 years which could be the result 
of the annual periodicity (prominent also in Figure 9.12). Therefore, the climacograms derived by 
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the annual time series are also shown, whose plot is smooth. The Hurst parameter, estimated from 
the annual series is H = 0.79 and should only have a slight effect, i.e., bias, on return periods. 
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Figure 9.13 Empirical o-climacogram and K,-climacogram of daily data of the Tenmile Creek near Rimini, 
MT, USA. The lines have been constructed from the daily series and the points from the annual series. The 
power laws, fitted by regression on the points of the annual series and plotted as dotted lines, have slopes 
-0.26 and -0.27, respectively. The Hurst parameter, estimated from the annual series, is H = 0.79. 


Contrary to the earlier investigated time series, the Tenmile Creek series contains zero values 
(dry condition), namely, 48 zero values out of 31706 total values. This suggests a small 
probability dry, Pp) = 0.00151 at the daily scale. Yet this small value makes the analysis of low 
extremes unnecessary. Indeed, the return period of the zero value will be T)/D = 1/0.00151 = 
660.5, or Ty = 660.5 d = 1.81 years. Therefore, for T > T) = 1.81 years all distribution quantiles 
will be zero. Hence, we only study the high extremes here. 

Even though the bias is expected to be small in this case (H = 0.79 < 0.80) we take it into 
account for illustration purposes. The upper and lower panels of Figure 9.14 depict fitting without 
and with bias adaptation, respectively. In the latter case, the bias correction factor is 0 ~ 
2H(1 — H)/(n—1) —1/2(n — 1)2-24 = —0.0064. As we have done in Digression 9.D, using p’ 
20+(1- 20)p(a+e)") we transform each p to p’. For p = 1,p’ = 1 and for p = n = 31706,p’ 
28 122 (an 11% reduction). The return periods are then calculated as Te) =Ap' + A, — Aw: 
The fitting in the Figure 9.14 was made with weights w = 1. It can be seen that the bias adaptation 
has only slight effects on the parameters, shown in the caption of the figure. 

Fittings on the tails, with w = 0 for T < 0, are shown in Figure 9.15. Its two panels compare 
the two fitting cases (a) considering all data, including the highest value (left panel), and (b) 
considering all data but the highest value (right panel). The differences in the two cases are 
dramatic, both in the distribution parameters (shown in the figure caption), particularly in the tail 
index € and the resulting return periods. In case (a) the return period of the highest value is 213 
and 383 years if estimated empirically and theoretically, respectively. The difference is 
reasonable for an outlier. But in case (b) the theoretical return period becomes 8600 years, 22 
times higher. Empirical return period cannot be assigned in this case as the value has been 
excluded from the analysis. One can further observe that the fitting in case (b) looks better as the 
agreement between model and (censored) observations is perfect. In case (a) there is difference 
between theoretical and empirical estimates and also between empirical estimates derived by K- 
moments and order statistics. However, given the huge difference of the fittings in the two cases, 
and adopting an engineering point of view, we should clearly prefer the fitting of case (a) and 
abandon any temptation to dismiss the extraordinary extreme for the sake of modelling elegance. 

This case also emphasizes the better performance of K-moments against order statistics. 
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Figure 9.14 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Tenmile 
Creek near Rimini, MT, USA, without (upper) and with (lower) bias adaptation (the points based on order 
statistics are not adapted in either case). The distribution function is depicted in terms of return periods T, 
either on their entire range (left column) or focusing on those greater than 1 year (right column). The 
parameters are for the upper panels: € = 0.305, ¢ = 0.896, A = 0.37 m?/s,x, = 0, and for the lower panels: 
& = 0.308, ¢ = 0.887, A = 0.36 m?/s, x, = 0. For further explanations see caption of Figure 9.3. 
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Figure 9.15 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Tenmile 
Creek near Rimini, MT, USA, as in Figure 9.14 but with a fitting focusing on T > 1 year by considering the 
highest value (left) or not (right). The parameters are, for the left panel: € = 0.325,¢ = 5.08,A = 
1.25 m*/s, x, = 0, and for the right panel; é = 0.205,¢ = 20,A = 2.67 m* /s,x, = 0- 
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9.6 Some general remarks 


The above framework, implemented in several case studies, suggests that the PBF 
distribution on one hand and the K-moments approach on the other hand, provide 
acceptable means to deal with the distribution of streamflow and particularly its tails. 
However, in most of the cases a single parameter set is not enough to cover the entire 
span of discharge variation, which can reach 6 orders of magnitude. We generally need 
different parameter sets for the high and low extremes, and perhaps a third one for the 
body of the distribution. Notably, the methodology is exactly the same in all three cases. 
Only the objective function (weighted fitting error) changes. 

The case studies examined in the Digressions cover different conditions with varying 
characteristics. An additional case study is given in Digression 9.F for a large European 
basin which is not pristine and has a high baseflow, much higher than all basins examined. 
As already stated in section 9.1, we cannot draw generalized conclusions from a few 
catchments. However, some indications can be seen in Table 9.1 which gathers the fitted 
parameter sets of all case studies. The tail index € can take large values, but not higher 
than 1/3, which means that the classical coefficient of skewness is generally finite, even 
though that of the kurtosis could be infinite. For large basins the tail index tends to 
become zero, which agrees with the indicative theoretical analysis of Digression 9.A. The 
lower-tail index ¢ varies considerably; in most cases it is higher than 1 (bell-shaped 
density function) but it also can be smaller than 1 (decreasing density function). The 
lower bound x;, is generally >0 but can be zero for small ones, where a probability dry >0 
can emerge. 


Table 9.1 Fitted parameter sets of all streamflow records studied. 














River and stream Optimization 
eee Area (km?) H (-) a §(-) ¢(-) A(m3/s) x, (m/s) 
Tenmile Creek near 80.0 0.79 E+E (w=1) 0.308 0.887 0.36 0 
Rimini E,T > 1 year 0.325 5.08 1.25 0 
French Broad River 2447.5 058 E+E (w=1) 0.307 2.40 51.4 4.38 
at Asheville E,T > 1year 0.277 5.06 81.7 4.87 
E,T>1year 0 2.16 77.3 4.40 
Susquehanna River 29059.7. 0.61 E+E(w=1) 0.202 1.43 495.6 15.56 
at Danville E,T > 1year 0 0.721 339.1 15.64 


E,T>1year 0.5 2.06 199.4 14.60 
Fourmoments 0.247 1.04 329.8 15.64 
Po River 70091 0.61 E+E(w=1) 0.088 2.03 1879.8 158.9 
at Pontelagoscuro E,T > 1year 0 1.71 2419.1 0 
E,T>1year 0.088 2.39 1582.4 148.1 
Red River of the North 779586 0.91 E+E(w=1) 0.194 0.906 106.9 0.049 
at Grand Forks E,T > 1year 0 0.720 148.5 0.033 
E,T>1year 0.5 1.008 59.7 0.047 
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Digression 9.F: An example for a large European river 


In a final example we use the streamflow record of the Po River at Pontelagoscuro, Italy (near the 
city of Ferrara; 44.888°N, 11.607°E). The drainage area is large, 70 091 km2. The Po River has 141 
main tributaries and the related river network has a total length of about 6 750 km and 31 000 
km for natural and artificial channels, respectively. About 450 lakes are located in the Po River 
basin. The water level of the larger south-alpine lakes of glacial origin is regulated according to 
given management policies, therefore obtaining a regulation volume of approximately 1.3 km:. 
(Montanari, 2012). Therefore, because of regulation, the streamflow in this case is not natural. Yet 
one may assume that the highest floods would not differ substantially from natural as the 
regulation margin is diminished. The data cover the period January 1920 to December 2009 (90 
full calendar years, uninterrupted; daily values n = 32 850). 

The o-climacogram and the K,-climacogram of data are shown in Figure 9.16. The Hurst 
parameter, estimated from the annual series is H = 0.61 (a moderate value) and its effect (bias) 
on return period estimation is negligible. 
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Figure 9.16 Empirical o-climacogram and K,-climacogram of daily data of the Po River at Pontelagoscuro, 
Italy. The power-laws, fitted by regression and plotted as dotted lines, have slopes -0.45 and -0.42, 
respectively. The Hurst parameter estimate is H = 0.61. 


The global fitting, performed by the procedure of Digression 9.B, is shown in Figure 9.17, where 
it can be seen that this fitting is not quite satisfactory for both distribution tails, as well as for the 
body of the distribution. Good fittings on the tails are shown in Figure 9.18, again performed in 
the same manner as in Digression 9.B. 
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Figure 9.17 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Po River 
at Pontelagoscuro, Italy. The distribution function is depicted in terms of return periods T or T (upper and 
lower row, respectively), either on the entire range with parameters: € = 0.088,¢ = 2.03,4 = 1879.8 m?/ 
s,x, = 158.9 m?/s (left column), or focusing on T > 1 year, with parameters: ¢€ = 0,¢ = 1.71,4 = 
2419.1 m3/s, x, = 0 (right column). For further explanations see caption of Figure 9.3. 
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Figure 9.18 Comparison of empirical and theoretical PBF distribution fitted on daily data of the Po River 
at Pontelagoscuro, Italy, as in Figure 9.17 but with a fitting focusing on T > 1 year (left) or T > 1 year 
(right). The parameters are, for the left panel: € = 0,7 = 1.71,A = 2419.1 m?/s, x, = 0, and for the right 
panel: € = 0.088, = 2.39,4 = 1582.4 m3/s, x, = 148.1 m?/s. 


Chapter 10. Extremes of atmospheric processes 


We have already studied the most demanding processes: streamflow, which varies among 
several orders of magnitude, and rainfall, whose study requires description at many time 
scales simultaneously. The study of extremes of atmospheric process is easier and no 
additional methodology is required. In addition, databases with observational 
information of atmospheric processes abound. This information includes ground data 
(measurements at meteorological stations), reanalyses (gridded data resulting from 
assimilation of meteorological measurements into weather models) and satellite data 
(resulting from images and observations by remote sensing instruments and 
incorporation of ground measurements). Here we will deal with ground data only, which 
are best suited for the study of extremes. The other categories are useful for studies in the 
global or continental scales (see examples in Koutsoyiannis, 2020b). 

Atmospheric data such as wind, temperature, pressure, radiation, etc., can easily be 
retrieved from several databases publicly available. Among them, KNMI’s Climexp 
system’, in connection with the European Climate Assessment & Dataset project (ECAD; 
Klein Tank et al., 2002)+ is the most convenient. Other data such as relative humidity, 
vapour pressure and dew point can be accessed through national databases; good 
examples are the Climate Data Center (CDC) of the German Meteorological Service 
(Deutscher Wetterdienst)+ and the USA NOAA’s National Centers for Environmental 
Information (NCEI).§ 

In the following sections of this chapter, we provide representative examples for the 
wind speed, temperature and dew point, where the latter is similar (and measured on 
same units) with temperature, but has higher hydrological importance as it determines 
the quantity of water vapour in the atmosphere. The behaviour of wind speed is not very 
different from that of precipitation, even though its variability is smaller. Temperature 
and dew point have different behaviour as their distributions are well-formed bell-shaped 
and not very distant from the normal. The distribution of air pressure is also bell-shaped. 

In any of these processes the persistence can reach high levels (e.g. H = 0.9) and 
therefore, whenever this happens, the bias due to time dependence should be taken into 
account. Negligence (and often ignorance) of persistence is common among scientists and 
practitioners, less in the engineering community and more in the climatological 
community as well as in the insurance industry, which provides services related to 
extreme events. Not only does this negligence affect a distribution fitting but may also 
have negative consequences in the design, operation and management of structures. A 
relevant example is in wind farms, which are typically designed without studying 


“https: //climexp.knmi.nl/. 

t This also provides access to data (more than 20 000 meteorological stations): https://www.ecad.eu/; see 
also: https://data.europa.eu/euodp/en/data/dataset/jrc-tmy-tmy-download-service. 

+ https: //cdc.dwd.de/portal/. 


§ https://www.ncei.noaa.gov/access/search/dataset-search; http://www.ncdc.noaa.gov/isd; ftp://ftp. 
ncdc.noaa.gov/pub/data/noaa/. We note though that this data base is not as easy to use as the other ones. 
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persistence. In the operation phase, the operators are often surprised to see the 
installation deliver less power than the average for long periods. However, this is anormal 
behaviour, related to persistence; it is exactly the clustering of low wind periods. 

Most of the processes are characterized by double cyclostationarity, daily and annual. 
This should be taken into account when a full stochastic model of the process of interest 
is constructed (e.g. Dimitriadis and Koutsoyiannis, 2015b; Deligiannis et al., 2016). 
However, here we only deal with the marginal distributions with emphasis on extremes 
and we will omit the study of cyclostationarity. 


10.1 Wind 


As already mentioned, the behaviour of wind speed is not very different from that of 
precipitation. However, the variability of the former is smaller. As seen in Chapter 8, 
intermittence is a prominent behaviour in rainfall, which has been modelled through the 
probability wet (P,) or dry (P)). On the other hand, as seen in Chapter 9, in streamflow 
intermittence can appear either in a manner similar to rainfall (in small ephemeral 
streams), or in two-state wet mode, one in which the river is fed merely by groundwater 
(baseflow) and one dominated by flood. In the case of wind, in principle there is no 
intermittence in the sense of a state where air ceases to move. However, the very small 
wind speeds are often registered as zero, because of imperfection of the measuring 
equipment. This case emerges particularly at hourly or finer time scales. 

Long time series of wind speed suggest long-term persistence and, most often, heavy- 
tailed distributions (Tsekouras and Koutsoyiannis, 2014; Koutsoyiannis et al., 2018), even 
though the light-tailed Weibull distribution has been the dominant model in the literature. 

The stochastic analysis of wind speed has recently become very important because of 
its relevance with wind energy generation. In renewable energy design and management, 
it is the body of the distribution, rather than its tails, which matters. Indeed, values at 
either of the tails are not relevant to energy production because the production ceases 
when the wind speed is too low or too high. The lower tail per se is not quite important, 
apart from the probability that the wind speed is below the threshold at which the 
production ceases. However, the upper tail is very important for the safe design of the 
wind turbine and the entire construction, as the extreme winds determine the turbine 
loads. The calculations of the load (see e.g. Dai et al., 2011) are apparently out of our scope. 

Below we provide representative examples for studies of daily (Digression 10.A) and 
hourly (Digression 10.B) wind speed data. 


Digression 10.A: An example of fitting a PBF distribution on mean daily 
wind speed 


To illustrate the behaviour of wind on daily scale we use one of the longest data sets, that of the 
Hoofdplaat station in Netherlands (51.38°N, 3.67°E, 0.0 m). The data cover the period April 1908 
to December 2019 (111 calendar years, with an interruption of 10 months starting in December 
1944; daily values n = 40 720, all of which but four are >0). Characteristic plots of the time series 
are given in Figure 10.1. 
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Figure 10.1 Monthly plot of the daily wind speed time series at the Hoofdplaat station, Netherlands. For 
each month, the average, the minimum and the maximum daily values are plotted. 


The o-climacogram and the K,-climacogram of data are shown in Figure 10.2. The Hurst 
parameter, estimated from the annual series is H = 0.9 and its effect (bias) on return period 
estimation is substantial. 
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Figure 10.2 Empirical o-climacogram and K,-climacogram from daily wind speed time series at the 
Hoofdplaat station, Netherlands. The lines have been constructed from the daily series and the points from 
the annual series. The power laws, fitted by regression on the points of the annual series and plotted as 
dotted lines, have slopes -0.19 and -0.18, respectively. The Hurst parameter estimate is H = 0.90. 


The fitting of the PBF distribution, performed by the procedure of Digression 9.B, is shown in 
Figure 10.3. It can be seen that this fitting is quite satisfactory for the body and tails of the 
distribution, even though it was made with focus on T > 1 year tails. The optimal parameter € is 
0 and thus the PBF distribution in this case switches to the Weibull distribution. 
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Figure 10.3 Comparison of empirical and theoretical PBF distribution fitted on daily wind speed data of 
the Hoofdplaat station, Netherlands. The fitting was done based on noncentral and tail K-moment using the 
linear approximation (denoted in the figure “K-moments 1”). For comparison the nonlinear approximation 
(“K-moments 2”) is also shown but it is indistinguishable from the former case. Furthermore, empirical 
return periods assigned from order statistics are also shown. The points for K-moments have been adapted 
for the bias effect of time dependence but the points based on order statistics have not. The distribution 
function is depicted in terms of return period T, either on their entire range (left) or focusing on those 
greater than 1 year (right). The parameters are: € = 0,¢ = 2.28,4 = 7.53 m/s,x, = 0 and were fitted with 
focus on T > 1 year. 


Digression 10.B: An example of fitting a PBF distribution on mean hourly 
wind speed 


Hourly or sub-hourly wind data are rarer than daily in publicly available databases. However, they 
are quite useful for operational use in diverse tasks such as simulation of wind energy generation, 
design of wind turbines and estimation of wind load on buildings. Here we use one of the longest 
hourly data sets, that of the MIT station in Boston, Ma., USA (42.367°N, 71.033°W, 9.0 m). 
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Figure 10.4 Monthly plot of the hourly wind speed time series at the MIT station in Boston, Ma., USA. For 
each month, the average, the minimum and the maximum hourly values are plotted. 


The data are available from the NOAA system but a great deal of effort was needed to convert 
them to hourly time series, which was undertaken in the study by Dimitriadis and Koutsoyiannis 
(2018). They cover, with minor gaps, the period January 1945 to December 2014 (70 calendar 
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years with a few missing values; hourly values n = 589 551, 1% of which are zero). Characteristic 
plots of the time series are given in Figure 10.4, while the climacograms are shown in Figure 10.5. 
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Figure 10.5 Empirical o-climacogram and K-climacogram from hourly wind speed time series at the MIT 
station in Boston, Ma., USA. The lines have been constructed from the daily series and the points from the 
annual series. The power laws, fitted by regression on the points of the annual series and plotted as dotted 
lines, -0.12 and -0.09, respectively. The Hurst parameter estimate is H = 0.92. 


The fitting of the PBF distribution, performed by the procedure of Digression 9.B, is shown in 
Figure 10.6. It can be seen that this fitting is quite satisfactory for the body and tails of the 
distribution, even though it was made with focus on T > 1 year tails. In this case the tail index is 
substantially different from 0 (0.12 to 0.15, depending on the focus of the fitting, similar to the 
typical values in rainfall). 
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Figure 10.6 Comparison of empirical and theoretical PBF distribution fitted on the hourly wind speed time 
series at the MIT station in Boston, Ma., USA. The distribution function is depicted in terms of return period 
T, either on their entire range (left) or focusing on those greater than 1 year (right). The parameters are: 
€ =0.122,¢ = 3.08,A = 5.71 m/s, x, = 0 and were fitted on the entire domain of wind speed (a fit with 
focus on T>1 year increases the tail index to €=0.148). For further explanations see caption of Figure 10.3. 


In the above analysis we have used all values in the time series, including the zeros. If we 
exclude the zero values (ng = 6086), we can study also the resulting lower tail, using the n = 
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583 465 nonzero values. To visualize both tails in a single graph we use a probability plot in terms 
of the excess return period (see Digression 5.A). This is seen in Figure 10.7 where the fitting of 
the PBF distribution is excellent (compare the theoretical curve with the empirical ones based on 
K-moments), spanning 13 orders of magnitude of excess return period. 
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Figure 10.7 Comparison of empirical and theoretical PBF distribution fitted on the hourly wind speed time 
series at the MIT station in Boston, Ma., USA, excluding the zero values. The distribution function is depicted 
in terms of the excess return period T — D, on its entire range. The parameters are: € = 0.120, ¢ = 2.73,A = 
5.35 m/s, x, = 0.15 m/s and were fitted on the entire domain of wind speed. The fitting was done based on 
noncentral and tail K-moment using the linear approximation (denoted in the figure “K-moments 1”). For 
comparison the nonlinear approximation (“K-moments 2”) is also shown but it is indistinguishable from 
the former case. Furthermore, empirical return periods assigned from order statistics are also shown. The 
points for K-moments have been adapted for the bias effect of time dependence but the points based on 
order statistics have not. 


10.2 Temperature 


The study of temperature has been a very hot topic due its direct relationship with “global 
warming”, “climate change” and other similar expressions. Certainly, temperature is 
connected to the concentration of carbon dioxide in the atmosphere. Despite another 
heralded expression, “science is settled”, this relationship remains unclear, while in a 
recent study Koutsoyiannis and Kundzewicz (2020) have shown that the causality 
relationship is not clear. They suggested that the relationship of atmospheric CO2 and 
temperature (T) may qualify as belonging to the category of “hen-or-egg” problems, 
where it is not always clear which of two interrelated processes is the cause and which 
the effect. Examining modern data (1980-2019) of T and CO2 concentration, they 
concluded that, while both causality directions exist, the dominant direction is T > CO, 
and not CO — T as commonly perceived. Paleoclimatic data are even more categorical 
that changes in temperature precede those in CO2. 

At the global level, the temperature is meaningfully influenced by the rhythm of the 
major ocean-atmosphere fluctuations, such as the ENSO and IPO in the Pacific as well as 
the AMO in the Atlantic (Kundzewicz et al., 2020). At the local level, the global behaviour 
certainly plays a role on local changes, but other factors, most profoundly urbanization, 
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may also influence the temperature substantially. This is exemplified in Digression 10.C, 
referring to the longest instrumental temperature record in Milano, Italy. The behaviours 
seen in that record are characteristic for temperature time series. Among these are the 
high persistence and the light-tailed distributions close to normal, which entail much 
lower variability than in other variables such as rainfall, runoff and wind. 


Digression 10.C: Study of the longest instrumental temperature record 
(Milano, Italy) 


According to both the Global Historical Climate Network (GHCN) - Daily and the European 
Climate Assessment & Dataset project (ECAD), the meteorological station with the longest 
temperature record is that of Milan (Milano) in Italy (45.47°N, 9.19°W, 150.0 m). The 
measurements have started in 1763 and the data are available up to 2008 from KNMI (and other 
data bases). They cover, with minor gaps, the period January 1763 to November 2008 (246 years 
with a few missing values; daily values n = 89 686 for daily maximum temperatures and n = 
89 697 for daily minimum temperatures). Here we study both the daily maximum and the daily 
minimum temperatures, which are typically original (raw) measurements. In contrast, the more 
commonly used mean daily temperatures are processed data and thus subject to the changes in 
processing methodologies. 

For better understanding of the conditions in the wider area and also for seeing the evolution 
after 2008, we also study two adjacent stations with long records. The first is in Lugano, 
Switzerland (46.00°N, 8.97°E, 273.0 m, ~60 km from Milano), a small town with population of 
55 359 in 2018,1 which has not changed substantially in the last 50 years, while the geomorpholo- 
gy of the area does not favour urban expansion. The measurements have started in 1901 and the 
data are available up to date from KNMI (and other data bases). They cover, with minor gaps, the 
period January 1901 to June 2020 (more than 119 years with a few missing values; total values n 
= 43 573 for daily maximum temperatures and n = 43 585 for daily minimum temperatures). 

The second is in Monte Cimone, Italy (44.20°N, 10.70°E, 2165.0 m, ~185 km from Milano), a 
mountainous area with no settlements. The measurements have started in 1950 and the data are 
available up to date from KNMI (and other data bases). They cover, with minor gaps, the period 
January 1951 to November 2018 (67 years, in some of which there are missing values; total values 
n = 24 283 for daily maximum temperatures and n = 24 263 for daily minimum temperatures). 

The daily time series for Milano and Lugano are plotted in Figure 10.9, along with the running, 
on 10-year windows, maximum and minimum values for return period of 2 years. The latter were 
estimated on the basis of the K-moment for the appropriate moment order which corresponds to 
return period of 2 years. A separate plot is included for the running maxima and minima of 2-year 
return period for all three stations. All plots indicate upward and downward fluctuations with the 
upward ones prevailing for the minimum temperature in the period 1940 to today. At the same 
period, the maximum temperature in Milano is also increasing. However, by comparing the 
temporal evolution of maxima in Milano for the 2-year return period and those of the two nearby 
stations, Lugano and Monte Cimone, in which there is no increasing trend, it appears that 
urbanization might be the principal factor causing temperature increase, rather than global 
effects. On the other hand, the increase of the minimum temperatures in the recent decades seems 
to be a more general behaviour (Glynis, 2020). Apparently, this is a favourable behaviour as it 
reduces the incidence and impacts of extreme cold (see section 11.3). 

A typical “modern” interpretation of the situation would be to attribute the upward segments 
to global warming, which in turn would be attributed to human CO2 emissions, etc., and eventually 
to rely on climate models for future projections. However, here we prefer to investigate the 
stochastic properties of the time series and try to build a consistent stationary stochastic 
representation with long term persistence. Indeed, as shown in Figure 10.10, the o-climacograms 
and the K,-climacograms of all series suggest high values of the Hurst parameter, estimated from 
the annual series, between H = 0.91 and H = 0.94. Even the Lugano series of maximum daily 
temperature, which does not show a warming trend, suggests H = 0.91. 
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Figure 10.8 (upper) Location of three meteorological stations (source: Google Earth). (middle and lower) 
Depiction of urbanization in Milano in 1988 (population 3 506 838, urban extent 88 417 ha) and in 2013, 
respectively (population 6 402 051, urban extent 277 177 ha) (source: Glynis, 2019, from data provided by 
the Atlas of Urban Expansion?). 





Notably, the oscillations in the climacograms constructed from the daily series in Figure 10.10 
reflect the strong annual periodicity of temperature (cf. Koutsoyiannis, 2017). Yet, these have not 
influenced the estimation of H as this has been based on the climacograms of the annual series. 
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Figure 10.9 (upper) Time series of daily maximum (continuous lines) and minimum (dashed lines) 
temperature of Milano, along with the running, on 10-year windows, maximum and minimum values for 
return period of 2 years. (middle) As upper but for Lugano. (lower) Running, on 10-year windows, 
maximum and minimum values for return period of 2 years, for all three stations. For comparability, the 
temperatures of Monte Cimone were shifted up by 13 °C because of the altitude difference of 2 km, assuming 
a temperature gradient of 6.5 °C/km. 
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Figure 10.10 Empirical o-climacogram (single lines) and K,-climacogram (double lines) from daily 
maximum and minimum temperatures at Milano, Italy, and Lugano, Switzerland. The lines have been 
constructed from the daily series and the points of the same colour from the annual series. The points 
corresponding to the annual series suggest power laws with slopes as indicated. The Hurst parameters, 
estimated from the annual series, are H = 0.93 for both series of Milano, H = 0.91 for the maximum 
temperature at Lugano and H = 0.94 for the minimum temperature at Lugano. 


The fitting of the Weibull distribution on the daily maximum temperatures of Milano, 
performed by the procedure described in Digression 9.B, is shown in Figure 10.12. The fitting is 
not ideal for the entire range of temperature (left panel) but if we focus on the upper tail, for T > 
1 year, the fitting becomes perfect (right panel). 
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Figure 10.11 Comparison of empirical and theoretical Weibull distribution fitted on the daily maximum 
temperatures at Milano, Italy. The distribution function is depicted in terms of excess return period T — D, 
either on their entire range (left) or focusing on those greater than 1 year (right). The parameters are: ¢ = 
5.11, 2 = 35.20 °C, x, = —15.37 °C for the left panel and ¢ = 6.97,A = 40.23 °C,x, = —15.26 °C for the 
right panel. For further explanations see caption of Figure 10.3. 
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Figure 10.12 Comparison of empirical and theoretical Weibull and normal distributions fitted on the daily 
maximum temperatures in August at Milano, Italy. The distribution function is depicted in terms of excess 
return period T—D. The parameters of the Weibull distribution are: ¢ = 3.87,A = 15.79 °C, x, = 
13.71 °C and those of the fitted normal distribution 1 = 27.86 °C, 0 = 3.83 °C (the sample statistics are fi = 
28.13 °C,é = 3.33 °C). For further explanations see caption of Figure 10.3. 


The fitted Weibull distribution is in fact very close to the normal. This is illustrated in Figure 
10.12, which is similar to Figure 10.12 (left) but refers to the temperatures of the data of the 
hottest month, August, only. In Figure 10.12 both the Weibull and the normal distributions have 
been fitted and they are very close to each other. 
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Figure 10.13 Comparison of empirical and theoretical Weibull distribution fitted on the daily minimum 
temperatures at Milano, Italy. The distribution function is depicted in terms of excess return period T — D, 
on their entire range (left) and in terms of the excess return period of minima T — D, when focusing onT > 
1 year (right). The parameters are: ¢ = 5.80,A4 = 36.28 °C,x, = —23.45 °C for the left panel and ¢ = 
12.93, A = 39.06 °C, x, = —34.43 °C for the right panel. For further explanations see caption of Figure 10.3. 


In turn, the fitting of the Weibull distribution on the daily minimum temperatures of Milano, is 
shown in Figure 10.13. Here again the distribution shape is close to normal. The fitting for the 
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entire range of temperature (left panel) is better than that of maximum temperature (Figure 
10.12, left). If we focus on the upper tail, for T > 1 year, the fitting again becomes perfect (right 
panel). 





1 https: //appsso.eurostat.ec.europa.eu/nui/show.do?dataset=urb_cpopcb&lang=en 
2 http://atlasofurbanexpansion.org/ 


10.3 Dew point 


When studying the extreme events related to the water cycle, a variable more useful than 
atmospheric temperature is the dew point, defined to be the temperature at which the air 
must be cooled to become saturated with water vapour. The dew point is measured in the 
same units as temperature, and depends on the temperature on the one hand and on the 
presence of atmospheric moisture on the other hand. The relationship of these quantities 
is provided by the Clausius-Clapeyron equation, i.e., the law determining the equilibrium 
of liquid and gaseous phase of water, which maps temperatures to saturation vapour 
pressures. This law in essence describes an entropy-maximizing state, that is, a state 
where the uncertainty at a microscopic level becomes maximum, interestingly yielding a 
virtually deterministic law at the macroscopic level. While probability is typically used for 
inductive reasoning, utilising data and statistics, this law serves as an example in which 
probability can also be used for deductive reasoning, with the impressive result described 
in Digression 10.D. 

The dew point influences both the evaporation rate, which increases with an increasing 
departure thereof from temperature, as well storm intensity, which may increase with a 
larger dew point. It is natural to expect that, since temperature has been increasing in the 
recent decades, the dew point should have increased too. However, global reanalysis data 
(specifically from the ERA5 reanalysis") show a slower increase in the dew point 
(Koutsoyiannis, 2020b). Interesting related information about the zonal variation of the 
increase of temperature and dew point is provided by Figure 10.14, which depicts the 
difference of the Earth’s temperature and dew point from their averages in the period 
1980-99. A positive difference corresponds to an increase after 1999. It is important to 
note that the greater increases are located in the northern polar area. In the tropical zone, 
which is hydrologically most important as the main source of evaporated water, the 
temperature increase is half the global average, while there is no increase at all in the dew 
point. The latter point is of highest hydrological significance. 

While this information is useful for global and zonal studies, local studies should be 
based on local data. Their analysis is not different from that of temperature and the 
general behaviours reported for rainfall in section 10.2 are generally valid also for the 
dew point. This is illustrated in Digression 10.E. 





* The ERAS (Copernicus Climate Change Service, 2017) is the fifth-generation atmospheric reanalysis of the 
European Centre for Medium-Range Weather Forecasts (ECMWF), where the name ERA refers to ECMWF 
ReAnalysis. It spans the modern observing period from 1979 onward, with daily updates continuing forward 
in time, with fields available at a horizontal resolution of 31 km on 139 levels, from the surface up to 0.01 
hPa (around 80 km). It has been produced as an operational service and its fields compare well with the 
ECMWF operational analyses. 
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Figure 10.14 Zonal distribution of the difference of the earth temperature and dew point from 
their averages in the period 1980-99. Note that the graph represents averages for the entire 40+ 
year period, rather than differences between two periods (the latter are about twice the former). 
(Source: Koutsoyiannis, 2020b.) 


Digression 10.D: How entropy maximization at a microscopic level results 
in a macroscopic deterministic law 


Koutsoyiannis (2014a) has highlighted the probabilistic nature of the law that determines the 
equilibrium of liquid and gaseous phase of water by deriving it purely by maximizing probabilistic 
entropy, i.e. uncertainty. In particular, the law was derived by studying a single molecule (Figure 
10.15) and maximizing the combined uncertainty of its state related to: 


(a) its phase (whether gaseous, denoted as A, or liquid, denoted as B); 

(b) its position in space; and 

(c) its kinetic state, ie., its velocity and other coordinates corresponding to its degrees of 
freedom and making up its thermal energy. 


The partial entropies of the two phases, i.e., the entropies conditional on the particle being in 
the gaseous (A) or liquid (B) phase, are: 


Pa=Ca + (B,/2) In €,q + InV,, PB = CB + (Bp/2) In €p + In Vp (10.1) 


with c; (i = A,B) denoting a constant (incorporating several physical and mathematical 
constants), f; the degrees of freedom of a water molecule, ¢; the (thermal) energy of the water 
molecule and V; the volume available for the motion of the water molecule in the specified phase. 
As the water molecule has a 3-dimensional (not linear) structure, the rotational energy is 
distributed into three directions, so that the total number of degrees of freedom (translational 
and rotational) is 6, = 6. The number of degrees of freedom in the liquid phase is greater than 6 
because of the “social behaviour” of water molecules. Specifically, in addition to the translational 
and rotational degrees of freedom of individual molecules, there are local clusters with low energy 
vibrational modes that can be thermally excited. The average number of degrees of freedom per 
molecule (individual and collective involving more than one water molecules) is very high, 6g = 
18. 
The total entropy is: 


PY =TaPa + TpPp t Pry (10.2) 


where 77; is the probability that the molecule is at phase i, with corresponding entropy: 
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Q, = —T,alnt, —Tpln Tp, (10.3) 


Thus, the total entropy can be written as: 


~ =Ta(Pa — Ina) + Te(Pa — In Ta) (10.4) 
The two phases are in open interaction and the constraints are: 
Ta +p = 1, TaEn + Ta (Eq a é) =& (1055) 


where € is the amount of energy required for a molecule to move from the liquid to gaseous phase 
(i.e. to break its bonds with other molecules, the phase change energy). 
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Figure 10.15 Explanatory sketch indicating basic quantities involved in the equilibrium of the water 
vapour with liquid water, with zoom on a single molecule which “tries to hide itself” by maximizing the 
combined uncertainty related to its phase (being either gaseous or liquid with probabilities 7a and 7p, 
respectively), position and kinetic state. 


We define the natural temperature, 0, which has units of energy (joules) rather than 
temperature (kelvins), in accordance to the probabilistic principle that entropy is a dimensionless 
quantity @, as: 

1 do 
acre 10.6 
6 dE ( J 

Denoting e the partial pressure of the N, water molecules being in the gaseous phase and 
maximizing the entropy in that phase, we obtain the law of ideal gases in the form (Koutsoyiannis, 
2014a): 

Na@ @ 
e=— =_sev=8 (10.7) 
Va Vv 
where v := Vy/Ng. 

Furthermore, by maximizing the combined entropy of the two phases, as given in equation 
(10.4), we obtain the law of the equilibrium of the two phases as (Koutsoyiannis, 2014a): 


Bp/2—- Ba/2-1 


e = €) exp (£( = *)) Se (10.8) 
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where (60, eo) are the coordinates of the triple point of water (specifically, 09 = 37.714 yJ 
corresponding to Tp = 273.16 K, eo = 6.11657 hPa). 
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Figure 10.16 (upper) Comparison of saturation vapour pressure obtained by the proposed equation in 
either of the forms (10.8) or (10.10) and by a standard equation of the literature, namely, e = 
€y) exp(19.84(1 — T,/T)). (lower) Comparison of relative differences of the saturation vapour pressure 
obtained by the proposed and the standard equations with accurate measurement data of different origins, 
as indicated in the legend and detailed in Koutsoyiannis (2012). 


The same law (also known as Clausius-Clapeyron equation in integrated form) can be written 
in more customary notation, in terms of absolute temperature in kelvins and using macroscopic 
quantities, as (Koutsoyiannis, 2012): 


7 a (1 4 ey (10.9) 
e@ = €9 exp Rie T T ; 


where (70, eo) are again the coordinates of the triple point of water, R is the specific gas constant 
of water vapour (R = 461.5 J kg! K-1), a = ER/k = EN, (with k the Boltzmann’s constant and N, the 
Avogadro constant), cy is the specific heat at constant pressure of the vapour and cy is the specific 
heat of the liquid water. By substitution of the various constants, we end up with the following 
form of the equation (Koutsoyiannis, 2012): 
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e := e(T) = ey exp (240921 (1 = *)) em (10.10) 


This form is both convenient and accurate (more accurate than other customary forms, 
theoretical or empirical, as illustrated in Figure 10.16). 

A state in which the vapour pressure e, is lower than the saturation pressure e(T) is 
characterized by the relative humidity: 


C4 _é e(Ta) 
ee @is e e(T). 


which serves as a formal definition of both the relative humidity U and the dew point Tq. In 
particular the dew point is calculated by the following equation, direct result of (10.11): 


T, =e-(U e(T)) (10.12) 


where e~1() is the inverse function of e( ); its numerical handling is discussed in Koutsoyiannis 
(2012). As in equilibrium the maximum U is 1, it results that T is an upper threshold of Tg. 





(10.11) 


Digression 10.E: An example of fitting a distribution on daily dew point data 


The dew point can be derived by equation (10.12) from measurements of temperature and 
relative humidity, but can also be measured by devices called hygrometers. Data from raw 
measurements are not quite frequent, yet the KNMI database provides data of daily maximum 
dew points in 35 stations in the Netherlands. The station with the longest record is De Bilt 
(52.101°N, 5.177°E, 2.0 m). Its data cover the period January 1951 to May 2018 (66 calendar years 
with the entire 1960 missing and a few missing values in other years). The number of daily values 
is n= 24561, 2.6% of which are below 1 °C, while no negative values are contained. 
Characteristic plots of the time series are given in Figure 10.17. 
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Figure 10.17 Time series of daily maximum dew point at De Bilt, Netherlands, along with the running, on 
10-year windows, average and maximum value for return period of 2 years. 


According to equation (10.12) and Figure 10.16, negative values would have been expected, 
but they cannot be directly measured. Therefore, the smallest of the values (<1 °C), while were 
kept in the climacogram analysis, were not used in the distribution fitting. 

Focusing on the running maximum for return period of 2 years, which is also plotted in Figure 
10.17, we may observe some fluctuation with a decreasing trend before 1990 and an increasing 
one thereafter. One may thus speculate that the global warming caused temperature increase in 
De Bilt which in turn drifted the dew point, because of its positive correlation with temperature. 
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On the other hand, one should not forget the possible effect of urbanization on temperature, also 
having in mind the fact that the Netherlands constitute one of the most urbanized areas in Europe 
and in the world (Figure 10.18). 





Figure 10.18 Depiction of urbanization in Europe by comparing satellite images of night lights of (upper) 
2000 and (lower) 2012. Image source: https://www.nightearth.com/. The 2000 image was created by 
NASA using data from the Defense Meteorological Satellite Program (DMSP)'s Operational Linescan System 
(OLS), originally designed to view clouds by moonlight. The 2012 image was captured by NASA using the 
Suomi National Polar-orbiting Partnership (Suomi NPP) satellite during April and October 2012 (from the 
“day-night band” of the Visible Infrared Imaging Radiometer Suite—VIIRS, which detects light in a range of 
wavelengths from green to near-infrared). 
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Therefore, avoiding attribution attempts, here we merely investigate the stochastic properties 
of the time series and try to build a consistent stationary stochastic representation with 
persistence. Indeed, as shown in Figure 10.19, the climacograms suggest a high value of the Hurst 
parameter, estimated from the annual series at H = 0.89. 
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Figure 10.19 Empirical o-climacogram and K,-climacogram of daily maximum dew point at De Bilt, 
Netherlands. The lines have been constructed from the daily series and the points from the annual series. 
The points corresponding to the annual series suggest a power law with exponent -0.14. The Hurst 
parameter, estimated from the annual series, is H = 0.89. 


The fitting of the Weibull distribution on the daily maximum dew point, performed by the 
procedure described in Digression 9.B and shown in Figure 10.20, is good for the entire domain 
of dew point except for the values < 1°C for the reasons explained above. The high value of the 
shape parameter ¢ (close to 10) suggests a distribution shape close to normal. 
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Figure 10.20 Comparison of empirical and theoretical Weibull distribution fitted on the daily dew point at 
De Bilt, Netherlands. The distribution function is depicted in terms of the excess return period T — D, on its 
entire range (left) and focusing on T — D > 1 year (right). The parameters are: ¢ = 9.98, A = 45.42 °C, x, = 
—34.43 °C and were fitted on the entire domain of dew point except for the values < 1°C. For further 
explanations see caption of Figure 10.3 


Chapter 11. Epilogue: Technology for risk reduction 


As already mentioned (section 5.3), the notion of risk incorporates three factors: the 
probability of occurrence of a dangerous event, the exposure and the vulnerability (Kron 
et al. 2019). More formally, the risk is usually defined as the product of three variables: 


R=HEV (11.1) 
which have the following meaning: 


e H is the hazard, i.e., the occurrence probability of a dangerous event (unit: 
dimensionless). 

e FE denotes the exposure, i.e., the “values” that are exposed to a dangerous event. 
These values are usually expressed in monetary terms (unit: e.g., $, €, ¥, P, 3) 
representing the economic value of the objects that are present at the location 
involved. In its severest form, E represents human lives (unit: dimensionless). 

e Vis the vulnerability, i.e., the lack of resistance to damaging or destructive forces, 
expressed as a value between 0 and 1 (unit: dimensionless), with the highest value 
1 representing full damage of the exposed values. 


There is a single means to control any of this variables: technology. Considering as an 
example the flood risk at a specific location, we can reduce the hazard by several 
technological solutions, e.g., by building a dam upstream of that location—this, however, 
is not possible for other types of hazards, e.g. earthquakes. We can also reduce the 
vulnerability by modelling the flood extent, delineating flood-prone zones and then 
implementing urban planning to prohibit or discourage human settlements at those zones 
(this applies to all types of hazards). Finally, we can reduce vulnerability, e.g. by 
developing early warning systems (Di Baldassarre et al., 2010). 

Risk is an objective quantity that should be distinguished from its perception. The 
latter is determined by other factors or interests: political, economic or social. For 
example, as a result of intensification of reporting of occurring disasters and projections 
of future catastrophes, people think that the risk from hydroclimatic extremes has been 
radically increasing. As we will see in the next subsections, this is just a social perception, 
in fact opposite to reality. 


11.1 Is hydroclimatic hazard increasing? 


A first indication supporting a negative reply to the question in the subsection’s title is 
provided by a list of world record point precipitation measurements compiled by 
Koutsoyiannis and Papalexiou (2017) for various time scales ranging from 1 min to 2 
years. As can be seen in Figure 11.1, reproduced from that study, the fact is that the 
highest frequency of record rainfall events occurred in the period 1960-80; later the 
frequency was decreased remarkably. 
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Figure 11.1 World record point precipitation measurements (locations and time stamps of the 
events producing record rainfall) for time scales ranging from 1 min to 2 years, compiled in 
Koutsoyiannis and Papalexiou (2017). The time scales on which the different events have given 
the record rainfall are as follows: A - 1 month to 2 years; B - 20 min; C - 2.17 h; D-5 min; E- 15 
min; F - 8 min; G - 3 min; H - 2.75 h; 1 -3h;J-42 min; K-1d;L-9h;M-18h;N-1 min; O-2.5h; 
P-30 min;Q-1h;R-2h;S-2d;T-6h;U-9dto15d;V-72 min; W-2dto7d. 


A more detailed analysis, again on global basis, has been provided by Koutsoyiannis 
(2020a), based on reanalysis and satellite data of daily rainfall. Analyses of precipitation 
maxima have been made to test the allegation by IPCC (2013a) about an intensification of 
the hydrological cycle and the related extremes. Notably, the intensification claim, if 
quantified, turns out to be of the order of 1%- 5% (IPCC, 2013a; Koutsoyiannis, 2020b,c). 
Such percentages of change are negligible and rather non detectable, given the high 
variability of precipitation articulated in previous sections. Nonetheless, here we 
reproduce some of the results of Koutsoyiannis (2020b) for the incredulous reader. The 
analyses have been performed separately for each continent and their results are 
presented graphically. Figure 11.2 shows the temporal evolution of the monthly 
maximum daily precipitation areally averaged over the continents. None of the sources of 
data in none of the continents provides support on the intensification claim. In particular, 
the observational data (CPC and GPCP) could support the opposite hypothesis, that of 
extreme rainfall deintensification. 
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Figure 11.2 Variation of the monthly maximum daily precipitation areally averaged over the 
continents. Thin and thick lines of the same colour represent monthly values and running annual 
averages (right aligned), respectively. Dashed lines are for reanalyses and continuous lines for 
observations. (Source: Koutsoyiannis, 2020b; NCEP-NCAR: reanalysis data; ERAS: reanalysis data; 
CPC: unified gauge-based daily precipitation gridded over land; GPCP: precipitation data set 
combining gauge and satellite precipitation data over a global grid.) 
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Figure 11.3 Variation of the standard deviation of daily precipitation in each month, areally 
averaged. Thin and thick lines of the same colour represent monthly values and running annual 
averages (right aligned), respectively. (Source: Koutsoyiannis, 2020b; GPCP: precipitation data set 
combining gauge and satellite precipitation data over a global grid; CPC: unified gauge-based daily 
precipitation gridded over land.) 
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Deintensification becomes even more evident if we examine the temporal evolution of 
standard deviation of daily precipitation in each month, averaged over land. In this 
respect, Figure 11.3, shows that deintensification, expressed as decreasing standard 
deviation, is evident in the 21st century both from CPC and GPCP observational data. This 
finding is consistent with earlier findings by Sun et al. (2012). A similar result is shown in 
a different manner in Figure 11.4 in terms of precipitation rate exceeding a threshold. 
Clearly, neither the frequency of high precipitation nor the sum of high intensity 
precipitation is intensifying. Rather, in most of the cases, there has been deintensification 
in the 21st century. Again, however, it will be more prudent to speak about fluctuations 
rather than deintensification. This is consistent with the general approach in this book to 
use stationary models (with appropriate dependence structure) for extremes, a 
suggestion also made in other works (Koutsoyiannis, 2003, 2006b, 2011a; Montanari, and 
Koutsoyiannis, 2014; Koutsoyiannis and Montanari, 2015; De Luca et al., 2020). 
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Figure 11.4 (left column) Average days per month with precipitation exceeding a threshold 
value, which is 10 mm/d (upper row) and 20 mm /d (lower row); (right column) monthly total 
of daily precipitation exceeding the threshold value. Thin and thick lines of the same colour 
represent monthly values and running annual averages (right aligned), respectively. 
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For examining the floods, we use the already mentioned (section 9.1) database of the 
US Geological Survey’s (USGS) National Water Information System and in particular the 
registered annual peaks by Hirsch and Ryberg (2012) in 200 stream gauges in the 
coterminous USA in pristine or near-pristine catchments, of at least 85 years length 
through water year 2008. Figure 11.5 depicts the frequency of a record high flood per 
decade, obtained from this database. The annual average (= 0.0109 events per year) and 
the confidence limits on decadal basis are also plotted in the figure. Fluctuations are 
evident in the time evolution, with low flood occurrences in the 1960s and high in the 
1900s and 1990s, with highest ones in the 1890s. Only in the 1890s was the frequency of 
record high floods higher than the upper confidence limit. Even this very old event would 
not be an issue of concern: first because an exceedance from the 95% confidence area in 
one out of 13 decades is not unnatural and second because if we considered the 
dependence structure and performed Monte Carlo simulations to determine the 
confidence limits, the confidence zones would be wider. 
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Figure 11.5 Frequency of a record high flood per decade, obtained from the database by Hirsch 
and Ryberg (2012) from 200 selected stream gauges in pristine or near-pristine catchments in the 
coterminous USA. The database contains 18 846 station years and the record highs are 205 (> 200 
because of some ties) and thus the average probability of a record high is 205/18 846 = 0.0109. 
The confidence limits are approximate, constructed for confidence level of 95% on the basis of 
independence (Papoulis, 1990, p. 284). 


About the possible intensification of the wind field over the globe, some information is 
provided in Figure 11.6 in terms of the global maximum wind speed, zonal and meridional. 
The plots do not show any noteworthy change (e.g. trend). Only slight fluctuations appear. 
Thus, the regime shown does not justify intensification of wind or of precipitation 
extremes that the latter could induce. 
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Figure 11.6 Variation of the monthly maximum daily wind speed, in each of the four directions 
(two zonal and two meridional) over the globe. Thin and thick lines of the same colour represent 
monthly values and running annual averages (right aligned), respectively. (Data source: ERA5 
reanalysis data retrieved from Climexp.) 


11.2 Do background conditions favour enhancement of hydroclimatic 
extremes? 


If we focus on storms and floods and consider the wind regime as a background condition 
that influences them, then, as we have already seen, there are no changes that would seem 
to favour intensification. Another possible background condition affecting precipitation 
extremes is atmospheric moisture. An extensive study thereof has been recently 
presented in Koutsoyiannis (2020b), from which we reproduce here a small part of the 
results, referring to the water vapour amount (also known as vertically integrated water 
vapour, or precipitable water* and expressed in mm or equivalently kg/m). This is 
estimated by radiosonde data of temperature and relative humidity on a local basis, but 
on a global basis, which is of interest here, it can be either estimated by reanalysis data or 
provided by satellite data. In all cases, the water vapour amount is the mass of water 
vapour, integrated over the entire atmosphere, per unit area. An increased water vapour 
amount could potentially lead to increased storm severity. 

However, the data, plotted in Figure 11.7 do not support such a case. Specifically, the 
data originating from two reanalyses, ERA5 and NCEP-NCART, which agree impressively 
well with each other, indicate fluctuation over time, with no monotonic trend. One of the 
two satellite data sets (NVAP) also agrees on the average, indicating no trend. However, 
the most recent satellite data set (MODIS) suggests a decreasing trend, just the opposite 


“The adjective precipitable for the water vapour amount is a misnomer: if the total water vapour amount in 
the atmosphere was indeed to precipitate in its entirety, this would violate the laws of thermodynamics. 


+ The NCEP-NCAR reanalysis is jointly produced by the National Center for Environmental Prediction 
(NCEP) and the National Center for Atmospheric Research (NCAR). Its temporal coverage includes 4-times 
daily, daily and monthly values for 1948 to present at a horizontal resolution of 1.88° (~ 210 km). It uses a 
state-of-the-art analysis/forecast system to perform data assimilation using observations and a numerical 
weather prediction model. The data assimilation and the model used are identical to the global system 
implemented operationally at NCEP except in the horizontal resolution. 
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of the IPCC predictions. In addition, Figure 11.8, which provides layered information for 
the MODIS data, shows that the decreasing trend is more pronounced in the upper 
atmospheric levels (440 to 10 hPa). 
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Figure 11.7 Variation of water vapour amount over the globe and the land and sea areas. Thin 
and thick lines of the same colour represent monthly values and running annual averages (right 
aligned), respectively. (Source: Koutsoyiannis, 2020b; NCEP-NCAR: reanalysis data; ERAS: 
reanalysis data; NVAP: satellite data from the NASA Pathfinder project; MODIS: satellite data, 
averages from the MODIS-Terra & MODIS-Aqua satellites.) 


In other words, as far as hydrological extremes are concerned, observations do not 
show any changes in the background conditions that would favour occurrence of more 
frequent or more intense extremes. 
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Figure 11.8 Variation of water vapour amount as in Figure 11.7 but only for the MODIS satellite 
data set and separately for its two platforms, Terra and Aqua: (left) total of the vertical column; 
(middle) from surface to 680 hPa; (right) from 440 to 10 hPa. Thin and thick lines of the same 
colour represent monthly values and running annual averages (right aligned), respectively. 
(Source: Koutsoyiannis, 2020b.) 


11.3 Is the risk from hydroclimatic extremes increasing? 


If we try to approach changes in the risk from extremes, including the influence of 
exposure and vulnerability, the ultimate measure of risk is the number of deaths from 
natural disasters. Relevant data are shown in Figure 11.9 for all natural disasters 
classified into five categories, three of which are of hydroclimatic type. 
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Figure 11.9 Evolution of the frequency of deaths from natural disasters per decade in the 20 and 
21st century. In addition to deaths from floods and droughts, deaths from other categories of 
natural catastrophes are also plotted: “extreme weather” includes storms, extreme temperatures 
(cold- or heatwave, severe winter conditions) and fog; “earthquake” also includes tsunamis; 
“other” comprises landslides (wet or dry), rockfalls, volcanic activity (ash fall, lahar, pyroclastic 
flow and lava flow) and wildfires. (Source: Koutsoyiannis, 2020b; victim data: OFDA/CRED 
International Disaster Database’; population data: United States Censust.) 


7 https://ourworldindata.org/ofdacred-international-disaster-data 


t https://www.census.gov/data-tools/demo/idb/informationGateway.php 
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Clearly, the impacts of hydroclimatic disasters, particularly the severest of them which 
caused human losses, have spectacularly dropped since the beginning of the 20* century. 
The victims from these categories of disasters have diminished, while other types of 
disasters still cause large numbers of victims. Thus, in the 2010s the primary cause is 
earthquake, representing 59% of the total number of victims. Obviously, the reason 
behind such diminishing is not that floods and droughts have become less severe or less 
frequent. Rather it is the improvement of technology, and risk assessment, management 
and reduction, along with the strengthening of the international collaboration and the 
economy, so that the advances could be actually implemented. 

Interestingly, according to data of 2010-2017, the deaths from natural disasters 
represent 0.08% of the total number of deaths, as seen in Figure 11.10. This number ranks 
them in the last position in Figure 11.10, with the penultimate cause being cold and heat. 
Deaths from cold and heat are registered together, while a multi-country analysis by 
Gasparrini (2015) suggests that these are mostly (at 95%) due to cold. For comparison, 
the contribution to deaths of respiratory diseases (belonging in the broader category of 
health issues) is 11.6%, about 150 times higher than natural disasters (and, apparently, 
this figure should have now increased due the Covid 19). The curious reader is 
encouraged to try to trace the reasons why the general perception of the public, informed 
by the media, is inverse to reality. Also, why the climate related risks, the least severe of 
all, have been promoted so enormously by international organizations and 
philanthropists. 
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Figure 11.10 Average share of deaths per cause in the 2010s. Data from the database of Our 
World in Data’; note that the total is slightly greater than 100% (101.4%, perhaps suggesting that 
in some of the cases there are two causes). 





* https: //ourworldindata.org/grapher/share-of-deaths-by-cause 
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11.4 Gazing the future 


The enormous promotion of climate related risks has been accompanied by the 
development of a paradigm of prophecy for the future of the planet and of humanity, 
based on models. There is no parsimony in the time horizon of such prophesies, which 
can reach the year 100 000 AD (Shaffer et al., 2009) or even 1 million years (Archer et al., 
2020). 

The prophetic approach is also quite pessimistic, generally predicting future disasters, 
more recently despising science and technology, if not attempting to deprive mankind of 
them, like in Aeschylus’s extract from Prometheus Bound, which appears as a motto in the 
beginning of the book. 

The book supported the more traditional historical approach, which is also stochastic, 
both in the modern and the ancient meaning of the term (cf. the quotation by Basilius 
Caesariensis in Digression 1.A). We use the scientific method to reveal hidden secrets of 
the past and quantify the evolution of natural processes. We use stochastics to describe 
that evolution in the past and to make induction for the future. 

History teaches us that technology has substantially contributed to risk reduction, to 
the quality and length of human life, and to human life as a value. It can thus further 
improve the present. By improving the present, using lessons from the past, we might 
develop an optimistic vision for the future—and indeed, the information presented in this 
epilogue allows it. 
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Stochastics of Hydroclimatic Extremes is a real monument in stochastics! It isa summary of the 
lifetime dedication by Demetris Koutsoyiannis to the science of environmental extremes, it is a 
demonstration of the value of stochastics itself to gain a better understanding of why and how 
extremes happen. The perspective adopted in the book is that of a scientist who is able to cross and 
transform disciplines by proposing an innovative synthesis of knowledge. This book is indeed 
presenting new concepts, new theoretical interpretations and new opportunities for engineering 
design, for the sake of mitigating the impact of extremes and adapting modern society to 
environmental variability. 


It is fascinating that the book is self-produced and openly available to readers. Like any self-produced 
creation of the humankind, this book has a unique and independent history that is rooted in the 
intimate personality of the author. It is a creation that does not require to adhere to any format 

other than those suggested by the author’s vision and creativity. For this reason, its value is 
incommensurably high, it is a real Cool Look at Risk as Demetris says. 


I believe time will highlight Stochastics of Hydroclimatic Extremes as a transforming masterpiece 
which will bring illuminating ideas to the reader. 
Alberto Montanari 


Head of the Dept. of Civil, Chemical, Environmental, and Materials Engineering, University of Bologna 
President of the European Geosciences Union 


This is a book that could not only transform your career, but also the entire fields of environmental 
statistics and stochastic hydrology. This seminal contribution is not like other books you have read 
which tend to summarize existing knowledge. Rather, it condenses existing knowledge in short order 
and spends nearly all its time on new knowledge, much of it never before published, communicating 
effectively both the theoretical and practical aspects of analysis of a wide range of hydroclimatic 
extremes. The style of presentation itself is novel and compelling, so that I could not resist reading it 
from cover to cover. 


If you think you understand how to apply probability and statistics to predict future extreme events, 
think again, because very quickly you will be convinced that extremes arise from spatial and temporal 
stochastic processes, and are neither independent nor identically distributed (iid) events, nor do 
most of our common probability distributions used for flood and drought frequency analysis capture 
the type of thick tails which are so convincingly documented in this book. 


I predict that many of the novel concepts, examples and techniques introduced here, many for the 
first time, will find their way into widespread acceptance in hydroclimatology, over time. Foremost, 
the reader will appreciate the value of viewing extreme events as realizations of stochastic processes 
rather than a series of iid annual maxima/minima. The climacogram provides a new window into the 
structure of stochastic processes and may be more fundamental than the correlogram. | can’t wait to 
test out the so-called Pareto-Burr-Feller distribution and the novel knowable moments (K-moments) 
which appear to have clear advantages over ordinary moments for describing distribution tails. 


It is remarkable that after a long career in hydrology, after reading this book, I gained many new 
insights into common statistical methods as well as new methods documented here for the first time. 
How I wish my career were just beginning, and thus could have applied all the wonderful ideas and 
methods in this book during my career. This is literally a treasure for young scholars interested in the 
probabilistic behaviour of hydroclimatic extremes. 

Richard M. Vogel 


Professor Emeritus and Research Professor, Dept. Civil and Environmental Engineering, Tufts University 
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